When developing software that works with data, syncing your database with the columns being inserted into it can be challenging. Developers are constantly adding new features that often require changing an application’s underlying database. Because traditional databases often require you to update tables and redesign schemas, they don’t support this constant change well. Several modern databases do, however. These tools either lack a real schema or have only a semi-structured design. Two such examples are CassandraDB and DynamoDB. Up until recently, only DynamoDB could be managed through AWS. However, at AWS re:Invent in December 2019, AWS revealed a new managed version of Cassandra DB.
Those who have experience managing CassandraDB clusters know that having to deal with them is tricky. Seeing this problem, Amazon developed the Amazon Managed Apache Cassandra Service. CassandraDB is not the only semi-structured database offered by Amazon. Another popular option for NoSQL databases is DynamoDB. Both data storage systems offer similar functionality; however, they approach data storage differently. This leads to a lot of differences in the way data is managed, stored, and distributed. After introducing both databases, this blog post will compare Amazon Managed Apache Cassandra Service to Dynamo DB in an effort to help you choose the best tool for your development needs.
A Brief Introduction to CassandraDB and DynamoDB
CassandraDB is a NoSQL database developed at Facebook by Avinash Lakshman. Its original focus was powering inbox searches. Unlike your standard relational database, it was developed to be distributed and decentralized. This framework allows Cassandra to quickly run reads across its wide column-based tables. Currently, CassandraDB is an open-source product under Apache, making it a very popular NoSQL database option.
DynamoDB is what is known as a key-value and document-based database. This means that data is stored with a key that points to a specific value, which, in this case, is what we will call a “document.” A document is essentially an object stored in JSON format. DynamoDB is a fully managed service, meaning that AWS manages the various clusters of databases automatically. Having redundant versions of the data stored elsewhere helps to reduce application downtime and increase performance. Because DynamoDB is serverless, there’s no need to manage any servers. All of these features make DynamoDB a great option when developing responsive apps.
Comparing DynamoDB and CassandraDB
From a high-level functionality standpoint, CassandraDB and DynamoDB are very similar. Both databases offer the ability to manage data without a specific column schema. The concept of tables still exists, but there’s no requirement to respect a specific set of columns as you might find with a MySQL or SQL server. That said, once you look under the hood, differences between the two databases in the areas of data storage, data indexing, security, and ownership become evident.
These two databases approach their unspecified columns differently, despite the fact that, technically, both can be considered wide column systems.
CassandraDB is what is known as a wide-column store. A wide-column store manages data in column families. Each column is connected to the primary key that groups back to one single row. This allows wide-column stores to not require a defined table structure. Rows in a wide-column database don’t need to have the same columns, enabling developers to dynamically add and remove new columns without impacting the underlying table.
As mentioned above, wide-column stores use what are known as column families. Each column value is attached to a primary key. An example of this can be seen in the image below, where the row is connected to each of the various columns and the values which they represent:
Figure 1: Wide-column store structure
In comparison, DynamoDB allows you to store dynamic data. It stores the data in JSON, using what is typically called document-based storage. Instead of storing columns separately, Dynamo DB stores all of them together in one document.
The images below illustrate that there is still a key that connects everything, even though the way the data is stored is quite different:
Figure 2: How data is stored in DynamoDB as JSON (Source: AWS management console)
Figure 3: How each JSON object is attached to an id, in this case a station_id (Source: AWS management console)
If you are used to working with databases, then you are probably accustomed to what is known as secondary indexes. Secondary indexes are indexes that aren’t the primary key but do help order and or search data faster. The goal of a secondary index is to help improve the performance of reading entire tables. DynamoDB and CassandraDB have slightly different secondary indexes.
DynamoDB offers two kinds of indexes: global secondary indexes and local secondary indexes. Global secondary indexes are based on a partition and sort key that don’t necessarily align with the base table. Since the sort key is stored separately from the base table, there is no need for the partition key to be the same. When a query scans the index, it’s essentially scanning a separate table, since the index is stored in its own partition space. The local index, on the other hand, has the same partition key but a different sort key. Having the same partition key allows the local index to be stored on the same partition as the base table.
CassandraDB also offers a secondary index. It works as what is called a distributed index, meaning the index is distributed across each node with the data it represents. This can cause problems down the line, as this process can create larger scans than you might want in your application environment. When using CassandraDB, implementing materialized views, which essentially act as a new table, is often suggested.
CassandraDB allows role-based user access. It also allows administrators to lock down data at the row level. This means a user can be limited to searching only specific rows based on their role or authorization.
For the most part, DynamoDB can be managed using AWS IAM. Data can be locked down at the attribute level, which is essentially the equivalent of locking data down at the column level. However, DynamoDB has taken security one step further by allowing you to encrypt your data at rest, not just in transit. Cassandra DB cannot do this without the incorporation of third party solutions.
AWS-Owned vs. Open-Source
DynamoDB was designed to be an AWS service, so it was developed with cloud service in mind and with the intention of making a profit.
CassandraDB was originally developed by Facebook. After time, it was open-sourced and then picked up by Apache. Since Cassandra can be used on AWS or not, it is a more portable option that can also help you avoid getting locked into AWS.
Known Issues with CassandraDB and DynamoDB
There are limitations to every database solution, including CassandraDB and DynamoDB. This section explores the pros and cons of each product.
Cassandra and Updates
Cassandra was designed for fast writes. As a result, it updates by creating a new version of a row with a fresh timestamp and flagging the old row as “to be deleted.” However, it doesn’t necessarily delete the row right away. If there’s too much data, this process can bog down the system. You can manage this with tunable consistency, a term which refers to how well updated your synchronized rows are. An unchecked overabundance of data can severely hamper the efficiency of your system.
DynamoDB and Table Scans
Many of the major DynamoDB issues, such as throttling, have been solved over the past few years. However, since you’re dealing with AWS, a company which profits off of the usage of their services, things like table scans on Dynamo DB can become very costly since you are charged on reads. This expense has often dissuaded users from attempting to implement DynamoDB as their NoSQL database.
DynamoDB and Serverless
DynamoDB and AWS’s Managed Cassandra DB both pair well with AWS Lambda. With DynamoDB, you can treat each function call like an API call. Responses will be returned as HTTP responses with JSON objects inside. Each response can be stored as is without having to parse any information in the database and then pass it on.
When using Cassandra, you will need to parse information. However, you won’t need to have a strictly defined table. This avoids the hassle of adding columns into the dev environment. Forgetting to do this often results in a broken endpoint.
In the end, both of these databases provide a lot of advantages to developers wanting to design systems that perform well and are easy to base applications on.
Which DB Is Best for You?
Amazon Managed Apache Cassandra Service has made Cassandra DB a more attractive option than DynamoDB. Prior to the managed version, it was a hassle to deal with all of Cassandra DB’s different nodes. However, this new service prevents you from having to manage clusters and enables you to avoid vendor lock-in. While DynamoDB does offer more effective table scans, those scans can become quite costly, and their efficiency may not outweigh their expense. Amazon Managed Apache Cassandra Service allows developers to focus less on the management of their database and spend more time on development, setting it up to replace DynamoDB as the go-to database of choice.