Introduction

In my last article about Cassandra, I had discussed about architecture of Cassandra distributed database at a higher level. We had discussed, how data is distributed and replicated across nodes in a Cassandra cluster to guarantee partition tolerance and availability of data.

In this article, we will take a deeper look within the Cassandra database to understand how it writes and reads data to and from the database. Each node in a Cassandra distributed database is treated same as any other node in the Cassandra distributed cluster and has the same internal architecture. We are going to look into the Cassandra architecture from an individual node prospective.

The Read/Write architecture

As mentioned in the introductory article, Cassandra is a peer to peer database where each node in the cluster constantly communicates with each other to share and receive information (node status, data ranges and so on). There is no concept of master or slave in a Cassandra cluster. Any node in the cluster can serve a write or read request at any given point of time even though the data being written to or read from doesn't belong to the node serving the request.

Following diagram represents the read/write path within a Cassandra database.

Well, this diagram seems too complicated. Let's simplify it and look in to the write path (i.e. how Cassandra writes the incoming data) first followed by the read path (i.e how data is read from a Cassandra database)

The Write Path

When a client performs a write operation against a Cassandra database, it writes the data into different parts of the database. We must be familiar with following components of the Cassandra database to understand its write path.

  1. Partitioning Key
  2. Commit Log
  3. Memtable
  4. SSTable
  5. Compaction

Let's briefly understand the above mentioned components of a Cassandra database before discussing more about the write path.

  • Partitioning Key: Each table (in turn each record) in Cassandra has a partitioning key, which helps in determining the locality of the data i.e. in which node in the cluster the data should be stored.
  • Commit Log: In simple words commit log is the transactional log (like redo logs in Oracle) and it is used for transactional recovery in case of system failures. Moreover, this is a append only file and is used to provide durability in a Cassandra database.
  • Memtable: Memtable is the memory cache of a Cassandra database (like buffer cache in case of Oracle database), which is used to store the in memory copy of the data. Each node in Cassandra has a memtable for each CQL table. The memtable accumulates writes (sorted by the partitioning key) and provides read for data which are not yet flashed (stored) to disk.
  • SSTable: This is the final destination of data in a Cassandra database. These are actual files on disk which store the data permanently in a Cassandra database. We can relate these to datafiles in case of a Oracle database. However, SSTable in Cassandra are immutable i.e. once the data is written to SSTable it is not updated (edited) again. In Cassandra, any modification (update) to a record is maintained as a new version (based on timestamp) and may get stored in same or different SSTable. There are few additional components associated with the SSTable which we will cover when we discuss about the read path.
  • Compaction: It is the periodic process of merging multiple SSTables into a single SSTable. Compaction is primarily done to optimize the read operations.

Following diagram represents the write path of a Cassandra database.

When a client issues a write operation, Cassandra writes the data into the Commit log as a first step (for durability) and then writes the same data to a in memory representation into the memtable. Once, data is written to the Commit log and the memtable, Cassandra sends the acknowledgement back to the client confirming the write operation (even though the data is not yet placed on the disk permanently).

As more writes come in, Cassandra will keep appending (in sequential order) the records into the Commit Log and then to the memtable. However, when Cassandra writes the data into memtable, it writes the data in the sorted order of table's partitioning key.

Now since, the memtable is in memory and can't keep on growing infinitely, Cassandra flushes (periodically) the records from the memtable (memory) to disk, which results into the creation of a SSTable. Each flush of the memtable results into a new SSTable in the disk. Moreover, the records stored in the SSTable are sorted by the partitioning key as they were in the memtable; which facilitates in sequential access of data from SSTable. Cassandra is specifically designed to perform sequential writes and reads as random reads and writes are prone to performance.

Once the data is flushed from memtable to SSTable, we no longer need the Commit log for data durability as well as the memtable, this is when the memtable is cleared and commit log (file) related to the records is removed and new commit log is created for incoming records.

In the event of any system failure, if the data was not flushed to disk before system failure; Cassandra will reload the unflushed data from commit log to memtable during system startup and then persist it in the disk during the flush operation.

Cassandra flushes the memtable to disk (SSTable) in the occurrence of following events.

  • When the memtable contents exceed a configurable threshold controlled by memtable_total_space_in_mb parameter.
  • When the Commit Log is full, controlled by commitlog_total_space_in_mb parameter
  • When nodetool flush command is issued

As we have discussed, each flush of the memtable results in to a new SSTable. This might not be an issue if the number of SSTables are less. However, over the period of time we will end up with huge number of SSTables and this could slow down the read performance as we might have to walk through multiple SSTables to read a record. Cassandra periodically performs a background operation known as compaction to merge the SSTables in order to optimize the read operation. Compaction is a critical and sophisticated process, I will discuss more about compaction in a upcoming article.

As we can see from the architecture, Cassandra has the simplest and fastest write path available which makes Cassandra most suitable for write-heavy workloads.

The Read Path

The read path in Cassandra database is little more complicated (actually way more complicated) than the write path, as there are variety of components involved in reading data from a Cassandra database. Following diagram represents the read path in a Cassandra database.

In addition to the components that we have discussed in the 'Write Path' section, we must also be aware of the following components to understand the read path of a Cassandra database.

  1. Row Cache
  2. Bloom Filters
  3. (Partition) Key Cache
  4. Partition Indexes
  5. Partition Summaries
  6. Compression offsets

Let's briefly discuss about the above mentioned components before talking more about the read path in Cassandra.

  • Row Cache: This is a in memory cache, which stores recently read rows (records). This is an optional component in Cassandra database.
  • Bloom Filters: Bloom filters helps to point if a partition key may be (exists) in its corresponding SSTable.
  • (Partition) Key Cache: Key Cache maps recently read partition keys to specific SSTable offset.
  • Partition Indexes:These are sorted partition keys mapped to their SSTable offsets. Partition Indexes are created as part of the SSTable creation and resides on the disk.
  • Partition Summaries: This is an off-heap in-memory sampling of the Partition Indexes and is used to speed up the access to index on disk.
  • Compression offsets: Compaction offsets keeps the offset mapping information for compressed blocks. By default all tables in Cassandra are compressed and when Cassandra needs to read data, it looks into the in-memory compression offset maps and unpacks the data chunks.

While reading the records from Cassandra, if the record is present in the Row Cache, the read would be served from the row cache without needing to look into any other location, as the row cache stores the recently read records. This is the simplest and fastest read path available in Cassandra.

If we miss the Row Cache, Cassandra will consult the Bloom filters that can tell in which SSTables the requested record doesn't exist. In this away Cassandra can avoid reading unwanted SSTables. Once Cassandra consults the Bloom filter to identify the SSTables to be scanned for the requested record, it reaches out to the (Partition) Key Cache, which has the mapping of recently read partition keys to their corresponding SSTables. If we find the partition key (of the requested record) in the key cache, we can directly fetch (into memtable) the record from the SSTable as the key cache will directly point to the specific record offset in the SSTable.

However, if we miss the key cache i.e if the partition key is not there in the key cache, Cassandra will look into the Partition Summaries which is nothing but a sampling of the Partition Indexes. Partition Summary helps to jump into a specific offset in the Partition Index. Once we are in Partition Index, we now have the offset of the partition key in SSTable and we can directly fetch (into memtable) the record from that offset of SSTable.

One point to mention here is that no matter how Cassandra fetches the record from SSTables, it always need to consult the Compression offsets to be able to read the data from compressed blocks.

Now, as we discussed earlier, in Cassandra there could be multiple version of the same record as Cassandra doesn't perform in-place modification. Cassandra attaches a timestamp to each version of the record (specifically to each column/field) and uses this timestamp to merge records from different SSTables and memtable to present the current version of the complete record.

As we mentioned earlier, the Row Cache is a optional component and if it is configured, it will be updated along the path while returning the record to the client.

Conclusion

In this article, we have discussed in detail the internal mechanisms involved in writing/reading data to/from a Cassandra database. Cassandra has the most simplest write path which makes it primary contender for write heavy workloads. We have also discussed, how different components works internally to serve a read request from a Cassandra database. Cassandra performs a lot of internal optimizations to provide efficient read performance.