x

Meet with us at AWS re:Invent and learn a better way to troubleshoot and secure your applications. Schedule Now!

Building a serverless data pipeline using Kinesis > Lambda > S3 Glacier pt. 1

Nov 19, 2019

 

building-daa-pipeline-pt1

Getting Started

Before we can really get going, you’re going to need a source of streaming data. For the purpose of writing this article, I made a simple React webpage that fires off a random number every second. I wanted to do this for a few reasons:

  • I write my Lambdas in NodeJS, so getting familiar with the JS syntax for Kinesis in the AWS-SDK seemed prudent.
  • I wanted a dead-simple way to start and stop the data stream as opposed to using, say, a Twitter stream.

If you are building your own producer, there is a handy tool from AWS called Kinesis Producer Library or KPL.

“The KPL is an easy-to-use, highly configurable library that helps you write to a Kinesis data stream. It acts as an intermediary between your producer application code and the Kinesis Data Streams API actions.”

 The official documentation can be found here.

Project Architecture

image4

As you can see, the general outline of our project is very linear. Lambda functions have the ability to both be triggered by and trigger a data pipeline, which can allow you do implement much more complex logic in terms of how you route your data through the various branches of your data pipeline.

Configuring a Producer

On either end of a Kinesis instance, there is a Producer and a Consumer. As the names might suggest, a Producer produces data for Kinesis, and a Consumer consumes it from Kinesis. Producers can be pretty much anything connected to the internet - phones, tablets, computers, refrigerators, mattresses, you name it. 

Given the nature of the beast, it’s somewhat difficult to advise you on how to make your producer. For the purpose of this tutorial, we’re going to assume that you have your data source already. The AWS-SDK node package includes methods to send data to Kinesis. If you are building your own producer, there is a tool from AWS called KPL, which I mentioned above.

For this tutorial, I’ve built a bare-bones producer that sends a random number between 1 and 100 to the stream every second. I’m leaving it intentionally simple so we don’t get bogged down with specifics about data handling. All you need to know is that the data is being sent to our Kinesis stream from an external source in a predictable format. 

Creating a Kinesis Stream using the Serverless Framework

Starting out with a new serverless project, we’re going to add a few lines to the aws-nodejs template:

  • profile
  • region
  • stage
  • iamRoleStatements

Except for the IAM role statements, these fields are largely dependent on your specific needs and goals for your project. It should look similar to this:

image1

Before we add to the iamRoleStatements, we’ll need to define a Kinesis instance for it to point to. Kinesis has two throughput mechanisms for you to choose from, Stream and Firehose. The main difference between the two is in how the data is stored and its scalability.

  • Stream
    • Made up of substreams called Shards that have a static throughput capacity.
    • There is no upper limit for the number of shards your stream can contain, save for certain regions. See the documentation for specifics.
    • A stream’s total throughput capacity is equal to the sum total of its shards.
    • Data is stored by default for 24 hours and is configurable to a maximum of 7 days.

  • Firehose
    • Has a throughput capacity of 500 records per call or 4 MiB per call, whichever is smaller.
    • Data is not stored except when its destination cannot be reached, in which case it will be stored for 24 hours.

In this case, we’re going to use a Kinesis Stream with a single shard. 

Setting up a basic Kinesis Stream with Serverless Framework is very straightforward. The only required properties are a name and a shard count. There are several other options defined in CloudFormation, and you can check them out here.

With your new stream defined, your resources property should look something like this:

image3

Now that we have our Stream resource defined, we can reference it in our iamRoleStatements. There are four permissions it needs:

  • GetRecords
  • GetShardIterator
  • DescribeStream
  • ListStreams

Using CloudFormations GetAtt function, we can dynamically reference the ARN of the Stream resource in the role statement. The final result should look like this:

image2

Wrap Up

Resources we made:

  • Kinesis stream

How are they connected?

  • The Kinesis stream receives a data stream from an outside provider.
  • In batches of 50, the records are sent to our Lambda.

To sum up, we started to build a basic data pipeline that consumes the streaming data with a Kinesis stream. In the second part of the series, we’ll set up a Lambda function that process this data and save the processed data to S3 Glacier. 

If you want to chat on several stages of serverless with us, you can ping us over Twitter(@thundraio) or join our Slack and let’s chat. You can sign up to Thundra or see live demo to see Thundra in action.