Building a robust data analytics pipeline is a critical step in determining the success (or failure) of your development efforts. Analytics pipelines let you analyze data trends from multiple sources, giving you as complete a picture of your product’s success as your data allows. The problem with data analytics work, though, is that it is often not suited to the constant-availability model of traditional software.
This article will serve as a central touchpoint for those looking to build a data pipeline on top of serverless services. We’ll define the problem, provide a few examples of the ways in which serverless technology can help, and aggregate useful resources to help you on your next step of the implementation journey.
What Are Data Analytics?
Data Analytics is a practice focused on answering questions about the state of your business. The process of data analytics procures these answers from the application and user behavior data generated by the activity taking place on your applications every day. Data analytics focuses on defining a set of key metrics and tracing those metrics back to a source of truth in the data maintained by your application. These metrics can be as simple as a download count for a specific file, or as complex as a multi-quarter revenue projection model.
Typically, your analytics are spread across multiple data sources. Understanding the flow of data in your application and how it moves between these multiple sources allows you to build a pipeline–transforming your data from a single source of truth and distributing it to each of the interested subscribers. This pipeline, as it grows more complex, can not only boost your visibility into your company’s performance but can help you respond in real time to outages and negative user impacts.
Data Analytics in a Serverless Context?
Serverless services, when integrated into your data analytics pipeline, can offer a large array of benefits to your application’s metrics reporting. First and foremost is the ability to implement your analytics using an on-demand execution model. Due to the large amount of processing involved, most analytics work lends itself naturally to batching. This model is tailor-made for a serverless architecture, as it allows you to only run the code–and incur charges–when you’re actually making use of your computing resources. Given the always-available nature of serverless technology, you can then bolster your pipeline with real-time responses to data changes, adjusting control flow and application logic based on the trends that emerge in your data as they happen.
Challenges & Considerations?
While serverless technology has a lot to offer data analytics, there are some elements of a robust data pipeline that will be more challenging in an on-demand execution context. The first is the maintainability of the pipeline itself. In a serverless application, each function call happens in isolation. A failed Lambda function has no concept of a call stack unless you build one into it. As such, your developers will need to spend extra time focusing on debuggability, request tracing, and monitoring of your serverless application’s behavior to a degree above and beyond that needed for a more traditional web application. Additionally, disaster recovery can in many ways be complicated due to the need to build reporting on third-party communication failures into nearly every API touch point used by your application.
Setting aside operational concerns, the disparate and ephemeral nature of serverless functions means you also need to work around a lack of support for transactional call patterns in your data (such as rolling back a call stack when a purchase fails). Also, common actions like user verification and security challenges need to be re-implemented across a number of different serverless functions.
Concepts in Building a Serverless Analytics Pipeline
It is important to note that there is no “perfect” model of a data pipeline. The components, patterns, and behaviors of data analytics pipelines will vary as widely as the applications upon which they are built. That being said, there are some basic tenets you can keep in mind as you build your application to ease the later creation of the analytics that will determine your success or failure metrics.
Source of Truth
To begin, it’s absolutely vital that you understand the source of truth for every key metric in your system. Where does the data originate? Where does it go? At what point is the data “official,” and how is that determination made? Answering these questions for all of your metrics is a time-consuming process, but it is absolutely necessary if you want to draw robust, confident conclusions from your analytics suite.
Data Flow Perspective
When building your application’s data model, you’ll also want to keep a serverless context in mind. While separation of concerns and reduction of coupling may call for two related objects to be spread across different endpoints, for example, you’re likely to incur higher maintenance costs due to the frequent cross-chatter between the two API locations. Analyze your application from a data flow perspective, and have a feel for the minimal parameter set needed at each data interface. You can use tools like Amazon Kinesis to construct robust data flows and data pipelines to power your application, for example, then incorporate Amazon Athena to improve your ability to query and analyze data, giving you a more complete view of your application performance.
Finally, you’ll want to build a robust Extract-Transform-Load (ETL) platform that can not only report on its own health but is able to recover from the inevitable application error. Your ETL pipeline is going to be the linchpin of your analytics suite, as it is responsible for ingesting the data from all applicable sources of truth and converting it into metrics that your business partners can use to plan your organization’s future. A traditional monolith ETL application running on an always-available server will be easy to monitor and institute recovery policies for.
However, upgrading this machine will often represent a prohibitive expense with arcane downtime requirements. While a serverless approach fixes this, the disparate nature of the components of your ETL product introduce additional needed complexity to achieve the same level of monitoring and disaster recoverability as would be present in a comparable traditional application.
Triggers for Data Analytics
The real power of serverless services for data analytics is in triggers. Triggers are actions in a set of online serverless services, such as AWS, that can cause the execution of custom serverless functions. By analyzing the triggers available to your platform and understanding the way that data flows through your application, you can create an efficient pipeline that performs the minimal transformation necessary at each step–and only when the underlying data itself changes!
This is particularly useful when marrying chatty data (such as application data from your database) with relatively static data (such as monthly financial reports). By implementing a set of triggers on your database, you can update intermediary calculations with the latest behavior metrics of your users in preparation for later usage. Then, when your more significant static data in S3 changes, you can use a trigger to combine the two data sets into a more cohesive picture.
While the exact usage of triggers will again vary widely from application to application, having a complete view of the flow of your data transformations will help you to make use of these powerful triggers. Tools like Amazon S3 Glacier can be added into the pipeline as well, allowing you to optimize for the velocity at which your data changes.
Caveats and Warnings for Serverless Analytics
While serverless services can provide a massive boost to any data analytics pipeline, they are not without their own pitfalls. Foremost among these is the challenge of monitoring the behavior of your pipeline. As your pipeline grows, the connections between the serverless functions that hold it together grow geometrically, creating an intricate web of behavior that needs to be maintained. If any node of this graph fails, you’ll want to both discover the failure and recover as quickly as possible. This requires extra effort for monitoring and reporting the health of your serverless functions.
In support of this, you’ll also want to focus heavily on transactional behavior in your data pipeline. What inputs depend on the output from a prior stage? If one function fails to execute, how is the resulting data impacted? Being able to answer these questions–and design around the answers they provide–will be critical when determining the correctness of your data. This will furthermore drive you to provide improved documentation and logging of your analytics pipeline, as the disjunct serverless functions make creating a unified log of application behavior particularly challenging.
The need for monitoring is especially evident when considering your serverless functions as a cohesive whole. While you can get statistics fairly quickly on individual Lambda function calls, aggregating those statistics requires creating an entirely separate analytics pipeline focused on the server-level performance of your analytics code. Without proper attention, this crucial element of a serverless application can be ignored, leading to suddenly ballooning infrastructure costs as your serverless application expands to accommodate load.
Serverless services have a lot to offer companies looking to bolster their data analytics pipelines. Take, for example, Arçelik A.S. out of Istanbul, Turkey. Arçelik was looking to better understand their user behavior and customer data and was thus interested in building a robust pipeline that leveraged serverless technology to provide more accurate results. Thundra was able to work together with Arçelik to build a robust data pipeline that used serverless services to create a highly-performant, fault-tolerant data pipeline, providing up-to-date metrics on user behavior instantaneously.
To read more about our work with Arçelik and how your data analytics can be bolstered with serverless services, check out the case study on the AWS Partner Success page.