Introduction

Over the last few years NoSQL databases have emerged as a cutting edge alternative to the traditional relational database systems. Prior to the NoSQL disruption, organizations used to rely on traditional database systems for all kind of workloads. A primary drawback of the relational databases is that they do not perform well at scale and the only option to scale those systems is to add more resources like RAM, CPU, Disks, etc. in to the existing machine (which is referred to as vertical scaling). However, with vertical scaling; relational databases still do not manage to perform exceptionally, primarily due to the design limitations with third normal forms and table joins.

The introduction of NoSQL databases changed this perception about database systems. Most of the NoSQL database are based on distributed architecture, where the system is comprised of a number of low cost commodity servers. NoSQL databases are known for their scaling ability, where by we can add more low cost commodity servers (referred to as horizontal scaling) to achieve scaling performance. However, NoSQL databases have their own limitations like no ACID compliance, lack of transactional feature, etc. and hence they can't be think about as a straight forward replacement of traditional relational database systems.

There are a number of popular NoSQL databases, which are ruling the database world. In this article, I will talk about one such NoSQL database called Cassandra which is getting widely adopted by businesses. We will explore the framework of Cassandra and some of the key terminologies used in Cassandra.

Before we begin

Before we begin with the discussion about Cassandra, we need to understand the foundation behind the design of a distributed database system. The foundation of a distributed system is based on the CAP theorem (illustrated in the following diagram), which states that it is impossible for a distributed computing system to simultaneously guarantee Consistency, Availability and Partition Tolerance. Therefore, each distributed system must do some trade off and choose any two of these three properties.

As per the CAP theorem, a distributed system can either guarantee Consistency and Availability (CA) while allowing some trade off with Partition Tolerance, or it can guarantee Consistency and Partition Tolerance (CP) while allowing some trade off with Availability or it can guarantee Availability and Partition Tolerance (AP) while allowing some trade off with Consistency.

Cassandra is highly Available and Partition Tolerant distributed database system with tunable (eventual) Consistency. In Cassandra, we can tune the consistency on per query basis (which would be a topic of another discussion).

What is Cassandra

Cassandra is a open source distributed database system, which is built with the understanding that system and hardware failures do occur. Cassandra is built in a way such that there is no single point of failure. Cassandra addresses the problem of system/hardware failure by implementing a peer-to-peer distributed database system across homogeneous nodes (servers/machines), where data is distributed across all the nodes in the Cassandra cluster as shown in the following diagram.

In Cassandra, there is no master/slave concept. Each node (server) in a Cassandra cluster is treated or functions equally. A client read or write request can be routed to any node ( which acts as a coordinator for that particular client request) in the Cassandra cluster irrespective of the fact whether that node owns the requested/written data. Each node in the cluster exchange information (using a peer-to-peer protocol) across the cluster every second, which makes it possible for the nodes to know which node owns which data and status of other nodes in the cluster. If a node fails, any other available node will serve the client's request. Cassandra guarantees availability and partition tolerance by replicating data across the nodes in the cluster. We can control the number of replicas (copies of data) to be maintained in the cluster through "Replication Factor" (more details in upcoming section). There is no primary or secondary replica, each replica is equal and a client request can be served from any available replica.

Data Distribution

Since Cassandra is a distributed database, it needs to distribute the data across all the nodes (servers) in a cluster to achieve scaling performance. Cassandra also needs to replicate the same data across multiple nodes in the cluster to guarantee availability and fault (partition) tolerance.

In Cassandra, data is distributed across the cluster using a mechanism called "consistent hashing" which is based upon a token ring. In general, a token ring is a range of values (in Cassandra it ranges from -2^(63) to 2^(63)-1) which is used to generate a subset of token ranges and then these subset of token ranges are assigned to all the nodes in the cluster. When we write/insert data, it is hashed on the portioning column (In Cassandra each row/record has a partitioning column) using a hashing function and then the hashed value of the data record is matched to the token ranges of the nodes in the cluster to determine which node in the cluster to place the data in. This mechanism of hashing and using a token range helps in evenly distributing the data across the cluster.

In a Cassandra cluster, the nodes need to be distributed throughout entire possible token range starting from -2^(63) to 2^(63) - 1. Each node in the cluster must be assigned a token range. This token range determines the position of the node in the token ring and it's range of data. For instance, if we have a token ring ranging from 0 to 200 and have 5 nodes in the cluster, then the token range for the nodes would be 0, 40, 80, 120 and 160 respectively. The fundamental idea behind the token range is to balance the data distribution across the nodes in the cluster.

In earlier versions of Cassandra (prior to 1.2), we had to manually assign the token range to each node while configuring a Cassandra cluster or adding/removing nodes in/from the cluster. In the earlier versions, we were generating the token range based on the number of nodes in the cluster and assigning a single token range per node in the cluster. This type of token assignment has limitations with respect to cluster maintenance like adding or removing nodes, balancing data across nodes, etc. Following diagram illustrates the manual method of token range generation for a 8 node cluster using the Cassandra token ring (value ranging from -2^(63) to 2^(63)-1).

For a better understanding, the following diagram represents the token range distribution for a 5 node cluster with tokens ranging from 0 to 200.

In version 1.2, Cassandra had introduced virtual nodes (vnodes) which eliminates the need of manual token assignment or management. With vnodes, the token ring (-2^(63) to 2^(63)-1) gets automatically divided to a number of small token ranges and each node can have multiple set of these small token ranges rather than just one taken range. In a vnode configuration, we just need to mention the number of token to be generated for a node in the cluster (while configuring Cluster or while adding a node) and Cassandra will automatically calculate or recalculate (while adding or removing node) and distribute the set of token ranges in the cluster as well as balance the data based on the token range distribution. In a vnode configuration, each node has a high number of small token ranges rather than just a single token of a very high range (which was the case with earlier versions).

Following diagram illustrates the generation of a high number of small token ranges (A to Q) generated by the vnodes method using the Cassandra token ring (value ranging from -2^(63) to 2^(63)-1).

In a vnode setup, by default each node will have 256 set of small token ranges. For instance, if we have a 10 node Cassandra cluster, there will be 2560 (256*10) set of small token ranges generated from the Cassandra token ring (-2^(63) to 2^(63)-1) as opposed to 10 set of huge token ranges in earlier versions.

Key components of Cassandra

We have covered the high level framework of Cassandra distributed database in the previous section. Let's talk about some of key structures/components of a Cassandra cluster as listed below.

  • Node: A node is basic infrastructure component of Cassandra cluster. Each node stores data and must run with same version of Cassandra binaries.
  • Data center: A data center in Cassandra is a set of related Cassandra nodes. Data center can be either a physical or virtual data center. Data centers are ideally configured for different set of workloads.
  • Cluster: A cluster contains one or more data center. It can span physical locations
  • RAC: A RAC is small subset of nodes in a data center or cluster which shares common failure resources like power supply. RAC can be considered as availability zone
  • Gossip: Gossip is the peer-to-peer protocol used by Cassandra for inter node communication. Cassandra uses this protocol to discover and share location and state information about the other nodes in the cluster. Cassandra also persists the gossip information on each node locally to use it immediately in the event of a node restart.
  • Table (column family): Table (column family) is a collection of ordered columns fetched by row. A row consists of columns and has a primary key (partition key).
  • Keyspace: Keyspace is a logical component in a Cassandra database. It is the namespace for a group of related tables (column families)

  • cassandra.yaml: This is the configuration file a Cassandra node. This is the file which contains all the initialization settings related to a Cassandra node.

  • Partitioner: A partitioner determines how to distribute the data across the cluster nodes and on which node to place the first copy of data. A partitioner is nothing but a hash function for computing the token of a partition key. Each row of data in Cassandra is identified by a partition key and is distributed across the cluster by computing the token of the respective partition key. Murmur3Partitioner is the default partitioning strategy for a Cassandra cluster. We define the partitioner in cassandra.yaml configuration file.

  • Replication Factor: Replication factor (RF) determines the number of replicas of data to be maintained across the cluster. A replication factor of 2 means, Cassandra will maintain 2 copies of each row in the cluster across different nodes. In that case if a node holding the data is not available, client requests still can be served using the other replica. We can set different replication factor for each data center in a cluster. Replication factor is defined on a per keyspace basis.

  • Replica Placement Strategy: Cassandra stores replicas (copies of data) on multiple nodes to ensure availability and fault (partition) tolerance. A placement strategy determines on which nodes the replicas to be placed. NetworkTopologyStrategy is the recommended replica placement strategy for a production cluster as that will place the replicas based on the network topology and supports easier expansion (if required) of the clusters. Again we define the replica placement strategy on a per keyspace basis.

  • Snitch: A snitch couples groups of machines into data centers and racks. In other words, snitch lays out the Cassandra cluster topology that the replication strategy uses to place the replicas. We must configure a snitch, when we create a cluster. The default is the SimpleSnitch, which doesn't recognize data center or rac information and hence it is not recommended. GossipingPropertyFileSnitch is recommended for production cluster. It defines a node's data center and rac and uses the gossip protocol to propagate this information to other nodes in the cluster. We use the cassandra.yaml configuration file to set the snitch type and a separate file (cassandra-topology.properties or cassandra-rackdc.properties) to configure the snitch.

Conclusion

In this article, we have explored the Cassandra distributed database at a higher level. We have discussed about foundation behind the design of a Cassandra distributed database and familiarized ourselves with a number of key terminologies used with a Cassandra distributed database. In the next article, we will take a deep dive in to Cassandra's internal database architecture and will try to understand how it functions internally.