x

[eBook Presented by AWS & Thundra] Mastering Observability on the Cloud 📖 Download:

Lambda Data Stores: Which One to Pick

Apr 9, 2020

 

lambda-stores-how-pick

AWS was the first cloud vendor to offer a way to run your code without servers when it introduced Lambda in 2014. This was a step change in cloud computing and finally allowed customers to run their code without having to manage anything related to where the code was actually run. However, you still do need to store data somewhere, and this will be the subject of this article.

Lambda functions have limitations. The two most important ones are a hard limit on execution duration (currently 15 minutes) and the fact that they must be stateless, both of which should be kept in mind when designing an application based on Lambda functions. For example, if you need to keep information from one Lambda invocation to the next, you need to store some state somewhere. Another element to keep in mind is the time limit of 15 minutes, which prevents you from moving large amounts of data, especially on slower data storage options such as S3.

Another limitation is that you can’t have more than 1,000 Lambda invocations running simultaneously. So, depending on the size of your application, it might be important to ensure that Lambda functions are running as quickly as possible and spending as little time as possible waiting for the chosen data store to finish its operations.

So, let’s go through some Lambda storage options.

S3

Amazon S3 stands for “Simple Storage Service,” and although it is simple to use (at least for the basic features), it is replete with features. S3 is an object-based storage with a very strong durability guarantee, but it’s a bit slow compared to other solutions.

S3 is ideal to store large, unstructured data. So if your Lambda functions work on pictures, videos or PDF files, S3 might be a good option. It also provides you with a fine-grained permission system, all the way down to S3 objects (i.e., files). S3 functions well for input to your Lambda functions, output, or both. A typical use case would be to read a file from S3, process the data, and write some transformed data back to S3.

S3 is not suitable for database-like requirements, such as searching inside the data or modifying parts of the file. Additionally, S3 can be a bit slow, so if you need to move large volumes of data around, you might have difficulty with Lambda functions if you’re using S3.

RDS

RDS is a managed service that provides you with a variety of database engines, such as MySQL and PostgreSQL. These databases are extensively used as a backend to a variety of applications, such as websites, enterprise software, or ETL (Extract-Transform-Load). If you are implementing an application that would traditionally require such a database, using RDS for a similar application implemented as Lambda functions would be a good choice.

Relational databases are suitable for structured queries and fast searches using SQL. Just as for traditional code, you would most likely use an ORM (Object-relational Mapping), which your Lambda functions would use just like with traditional applications. There is a large support community for ORMs in all major programming languages and all major database engines. 

Another advantage of using RDS as a data store is that most developers, sysadmins, and DevOps engineers are familiar with SQL and relational databases.

Finally, it’s noteworthy that AWS offers serverless relational database engines, but this typically costs significantly more than RDS instances–although performance is better as well.

One point to keep in mind are timeouts, as you should make sure that they are configured properly; otherwise, you might see your Lambda functions being forcefully terminated after the execution time limit. RDS could be used to store states and queues (and there are many libraries available to implement such things), but there are also better solutions.

Additionally, you need to be careful with slow queries. With Lambda, you are billed for execution time, even if your code is just sitting idly waiting for the query to finish. Ideally, you want your Lambda functions to return as quickly as possible. If they’re slow, you might need to add some indexes in your schema or use some caching mechanism.

NoSQL Databases

An alternative to the structured relational databases is to use NoSQL databases. Unfortunately, unlike SQL-based database engines, there’s no standard for NoSQL, so each NoSQL database engine implements something different. AWS offers many NoSQL options, so let’s go through them quickly in this section.

DynamoDB

DynamoDB is a home-grown database system offered by AWS. It has a very simple structure, where rows are indexed using a single key, making it easily thought of as a key/value system or a giant hash table. For such simple needs, DynamoDB is probably the best choice on AWS, as it has a low cost and very high performance. Additionally, DynamoDB is serverless, so you don’t have to do anything in regard to maintenance and very little to scale in or out.

DocumentDB

DocumentDB is a managed service from AWS that provides a MongoDB-like interface and is very good for highly structured data. DocumentDB is also a highly available service, although not serverless. Typically, the reads are fast (especially when using read replicas), but writes could lead to a bottleneck. So if your Lambda functions are write-heavy, DocumentDB might not be the best option for you. Also, the failover mechanism can take a bit of time (typically 10 to 40 seconds); this is, however, very unlikely to happen, so it would be perfectly reasonable to overlook for most applications.

Managed Apache Cassandra Service

AWS offers a managed Cassandra service as well: MCS. Cassandra is less flexible regarding data schema than MongoDB and has a less expressive object model. It is a multi-master, so the failure of a master node should have no impact on the availability of your app. Also, Cassandra is built from the ground up to work as a cluster and to be able to handle large amounts of data. 

Amazon Elasticsearch Service

Lastly, Amazon Elasticsearch Service is a managed offering from AWS that you can use to set up and manage an Elasticsearch cluster. Typically, Lambda functions are used to ingest data into Elasticsearch clusters, and certain Elasticsearch modules, such as ElastAlert, can be configured to be called on certain triggers.

Options to Store States

Lambda functions are stateless, so what are your options if you want to or need to store states? You can store states in databases (SQL or NoSQL), but this is suboptimal because those are suited for long-term storage and thus involve a lot of disk input/output. To store states, you want a fast service where durability is not a concern. You can easily recreate states (for example, a user has to login again), so the best solution is to use an in-memory caching server.

AWS offers managed solutions for both Memcached and Redis. Both solutions are very fast, in-memory, caching systems, which are ideal for storing states due to being much more efficient than using a SQL and NoSQL database disk.

Conclusion

One final point to keep in mind regarding storage for Lambda functions is that it is actually possible to store data on the local filesystem ( in `/tmp` ). This can be useful if temporary storage is required within the lifetime of the Lambda invocation and will be very fast. The limit is 512 MB though, and you need to make sure the Lambda function cleans up after itself because subsequent invocations (of this Lambda or other Lambdas) might be able to see the data.

The overall conclusion is that there are many choices available for data storage in the context of your Lambda functions, and in fact, the vast majority are not Lambda-specific. So a choice you would have made for a traditional application would still probably be valid for a Lambda-based application. Or at least it would be a good starting point that can be revised based on Lambda limitations.

Which storage is good for you will depend on how your data is structured, how fast the access should be and whether you want a tight integration with Lambdas or are happy to do the plumbing yourself. Here are a few examples:

  • To store data related to your users, DynamoDB would be a good choice, indexed by whatever you use to uniquely identify the user (email address, username, user id, etc.).
  • To store states between Lambda invocations, use Memcached or Redis.
  • To store structured data with strong constraints, use RDS.
  • To store loosely structured data with loose constraints, use DocumentDB or Cassandra.
  • To store unstructured data such as audio, video, images, and files, use S3.

In any case, it’s worth spending some time on properly designing your Lambda-based application. This will end up saving you time compared to if you use a traditional, server-based application, where maintenance and patching would still be required.