4 minutes read

POSTED Jan, 2021 dot IN DevOps

How and Why You Should Use Amazon Kinesis for Your Data Streams

Serkan Özal

Written by Serkan Özal


Founder and CTO of Thundra

 X

Creating live and real-time systems is an important skill in the world of cross-platform integration, instant notifications, and real-time data. A key component of creating a real-time system is steaming data from one application to another.

There are many great tools in the modern world that provide this ability, like Kafka, RabbitMQ, and Amazon Kinesis. All of these systems were developed with different goals and have their fair share of pros and cons. Today, we want to focus on Kinesis.

Kinesis is a managed streaming service on AWS. You can use Kinesis to ingest everything from videos, IoT telemetry data, application logs, and just about any other data format live. This means you can run various processes and machine learning models on the data live as it flows through your system, instead of having to go to a traditional database first.

The Publisher Subscriber Design Pattern

Before delving more deeply into how you can use Kinesis, it is important to understand publisher and subscriber design. This is often referred to as pub/sub, and this design pattern was developed to have a message sender, referred to as a “publisher,” push “events” to an “event bus” (such as Kinesis), which will distribute them to subscribers.

The key here is that the publishers actually have no idea the subscribers exist. Kinesis manages all the messaging.

Put differently, pub/sub is a system design pattern used to communicate messages without creating a highly coupled design, and instead, it focuses on utilizing independent components that allow for a distributed workflow.

Why Use Kinesis?

Kinesis as a streaming tool has some distinct advantages. In particular, it is a managed service, meaning that AWS, and not developers, handles much of the system administration. This lets developers focus more on their code and less on managing their system. Here are a few examples showing how companies utilize Kinesis.

Streaming Data Use Case

Kinesis is useful for both large and small companies managing and integrating their data across platforms. Here are two use cases where Amazon Kinesis proved to be the right solution for managing large amounts of data.

Netflix

Netflix uses Kinesis to process multiple terabytes of log data every day. For example, Netflix needed a centralized application that logs data in real-time. It developed Dredge, which enriches content with metadata in real-time, instantly processing the data as it streams through Kinesis. This makes it unnecessary to load data into a database to be picked up later and processed.

Veritone

Veritone, which provides AI and machine-learning services, uses Amazon Kinesis video streams to process customer data. Veritone can then apply machine learning models and AI to the content in realtime to further enrich it with metadata and metrics. With this additional metadata, Veritone makes Kinesis video streams easy to search by tagged information, like audio, facial recognition etc.

These are just two examples of how companies are taking advantage of streaming with Amazon Kinesis. Now let’s break down some key technical components you’ll need to decide on.

Streams vs. Firehose

Amazon Kinesis offers two main products to choose from, Kinesis Streams and Kinesis Firehose.

To work with Kinesis Streams, you’ll use the Kinesis Producer Library to put data into your stream. You can connect it to almost any application or process.

Kinesis Streams is not a fully managed service, which means your team will need to manually scale it as required. The data will only stay in the stream for seven days.

Kinesis Firehose is simpler to implement. Data from Firehose can get sent to S3, Redshift, or even Elasticsearch using the Kinesis Agent, and from there, you can process it. If the data is stored in S3 or another AWS data storage system, you can leave it there for much longer than seven days. In fact, you can decide how long it stays in the system.

Setting Up a Stream on Kinesis

Before you can start accessing Kinesis, you’ll need to set up a stream. An easy way to do this is to use the AWS CLI. In your command shell, use the command below to create a stream called YourGamerDataStream.

aws kinesis create-stream \
--stream-name YourGamerDataStream \
--shard-count 1 \
--region eu-west-1 

Creating a Streaming Pipeline with Python

With a stream set up, you can start to build a producer and consumer. These components will create an access layer where developers can integrate other systems.

We will be using the boto3 Python library to connect to the Kinesis instance. You will also need an AWS access key and secret to properly authenticate your actions.

Creating a Python Producer

Use the code below to create a Producer in python.

import boto3
import json
import logging

logging.basicConfig(level = logging.INFO)

session = boto3.Session(region_name='eu-west-1')
client = session.client('kinesis')

test_data = {'gamer_tag': 'JoeGamer', 'score': '100', 'character': 'Flame Warrior'}

response = client.put_record(
  StreamName='YourGamerDataStream',
  Data=json.dumps({
    "gamer_tag":  test_data['gamer_tag'],
    "score":      test_data['score'],
    "character":  test_data['character']
  }),
  PartitionKey='a01'
)

logging.info("Input New Gamer Score: %s", test_data)

In order to actually pull the data, we’ll need a script that listens for data being pushed to the producers. Developers can access the data getting pushed to AWS Kinesis using the ShardIterator. This object will allow you to access the current and future records that Kinesis contains. For subsequent reads, use the shard iterator that the GetRecords request in NextShardIterator returns.

Creating a Python Consumer

The code below displays how you can create a Python consumer.

import boto3
import json
import sys
import logging

logging.basicConfig(level = logging.INFO)

session = boto3.Session(region_name='eu-west-1')
client = session.client('kinesis')

aws_kinesis_stream = client.describe_stream(StreamName='YourGamerDataStream')

shard_id = aws_kinesis_stream['StreamDescription']['Shards'][0]['ShardId']

stream_response = client.get_shard_iterator(
    StreamName='YourGamerDataStream',
    ShardId=shard_id,
    ShardIteratorType='TRIM_HORIZON'
)

iterator = stream_response['ShardIterator']

while True:
  try:
    aws_kinesis_response = client.get_records(ShardIterator=iterator, Limit=5)
    iterator = aws_kinesis_response['NextShardIterator']
    for record in aws_kinesis_response['Records']:
        if 'Data' in record and len(record['Data']) > 0:
          logging.info("Received New Gamer Score: %s", json.loads(record['Data']))
  except KeyboardInterrupt:
    sys.exit()

In this example, we are only printing out the data. However, you might want to run a process or a machine-learning model over the data before storing it.

These examples will provide the framework for your future Amazon Kinesis development.

Problems with Kinesis Pipelines

For all the benefits Kinesis provides, you will face one major challenge when working with it:  observability.

This is because, as with many AWS components, you can create a very complex system. For example, you can use Lambda functions to act as the producer and consumer for your Kinesis stream that connects to several AWS data storage systems, like Redshift, S3, or DynamoDB. This can be very difficult to manage because tracking errors and dependencies can get masked by all of the serverless calls.

Although tracking where errors occur can be difficult, Thundra makes observability simple for AWS Data Pipelines on Kinesis. Thundra took many of the features provided by AWS X-Ray and enriched the service with increased logging and metrics, integrating its own tracing infrastructure and providing developers with nearly full observability.

Streaming Data: A Necessary Feature of the Modern World

Streaming data is a necessary component of real-time analytics and applications. If you need to get data fast and live, you’ll have to utilize some sort of streaming tool. Amazon Kinesis is a great option, since it’s fully managed and can easily be spun up. It also scales enough so that Netflix and other large companies rely on it to manage their massive amounts of data.

As you plan your next project, you may want to consider developing a real-time analytics tool or integrating a live data feed and implementing Amazon Kinesis as your streaming system.