Written by Juan Carlos Olamendy Turruellas

 

Introduction

 

In my last two articles I’ve talked about the architecture of Apache Cassandra (http://goo.gl/Xx0y6Z) as well as how we can provision a cluster (http://goo.gl/hWVBbz). In this article, I want to talk about the Cassandra query language (CQL).

 

CQL Basics

 

CQL is a declarative language similar to SQL. CQL is today the main API for interacting with Cassandra.

 

The key concepts are:

 

  • Keyspace. This is similar to a schema in Oracle database or a database in MySQL. In this way, it’s a container of database objects (tables and indexes). The most important aspect of a keyspace is for defining the replication factor and strategy for its own data    
     
  • Column family/table. This is similar to a table in Oracle database. It’s remarkable to say that in Cassandra, unlike relational databases, we don’t have to model all the required fields by the application upfront as well as each row is not required to have the same fields

  • Row key. This is similar to a primary key in Oracle database. It enables uniquely identifying a row. A row key is made of two parts (composite key using multiple columns): partition key and cluster key.
         
    *  Partition key: Determines the node where the data is stored
    *  Cluster key: Determine the sort order of the dataset
         
    In this way, the row key helps to distribute the data in the cluster, so when designing the data model, we must think carefully in order to distribute the data evenly.

 

When thinking about a Cassandra table, we have to think of a nested sorted map data structure (see listing 01) instead of a list of tuples such as in relational databases.

 

Map<RowKey, SortedMap<ColumnKey, ColumnValue>>

 

Listing 01

 


 

 

Figure 01

 

It’s remarkable to say that there are limitations in Cassandra query versus SQL:

  • CQL predicates (WHERE clauses) can only contain columns specified in the primary key
  • There is no join between CQL tables (JOIN clauses). In this way, we need to design our data model in de-normalized way
  • There is no sub-queries in CQL
  • There is no GROUP BY in CQL
  • ORDER BY clause can only applied to cluster keys

Demo part

 

Now, let’s start writing CQL statements in order to learn with practical example. First step is to launch the Cassandra shell as shown in the listing 02.

 

$CASSANDRA_HOME/bin/cqlsh

 

Listing 02

 

Let’s create a keyspace to hold the tables that we’re going to create. It’s common to have a keyspace for application. The replication factor defines that when a row is inserted into a table, then it’s also stored on three different nodes in the cluster.

 

cqlsh> CREATE KEYSPACE sampledb WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor': 3 };

 

Listing 03

 

In order to create tables inside the sampled keyspace, we need to navigate with the USE command (see listing 04). All further CQL statements will be executed in this context.

 

cqlsh> USE sampledb;

 

Listing 04

 

Now let’s create a table and insert some rows in it as shown in the listing 05. In this case, field01 is the partition key while field02 is the cluster key.

 

cqlsh:sampledb> CREATE TABLE example (

field01 int,

field02 int,

field03 text,

PRIMARY KEY((field01), field02)

);

 

cqlsh:sampledb> INSERT INTO example (field01, field02, field03) VALUES (1,2,'a');

cqlsh:sampledb> INSERT INTO example (field01, field02, field03) VALUES (2,3,'b');

Listing 05

 

Now let’s analyze how the data is stored. If we list the data directory (by default $CASSANDRA_HOME/ /data/sampledb/example-*), then we don’t see any data file. We can flush manually the data from memtable to disk in order to create an SSTable using the command nodetool as shown in the listing 06.

 

$ nodetool flush sampledb

 

Listing 06

 

Now we can realize that an SSTable is made of a several files:

 

  • Data.db: Contains the data related to the SSTable. All other SSTable related files are generated from this one
  • CompressionInfo.db: Holds information about the uncompressed data length
  • Filter.db: Contains the bloom filter
  • Index.db: Contains an index of the row keys with pointers to their position in the data file
  • Summary.db: Contains an index summary
  • Statistics.db: Contains statistical metadata about the data
  • TOC.txt: Contains a list of files outputted for each SSTable

 

Remember the data is stored using nested sorted map data structure as shown in the figure 01.

 

In order to visualize the internal storage format, we’re going to use sstabledump command to convert a binary SSTable into JSON format as shown in the listing 07.

 

$ sstabledump *-Data.db

 

Listing 07

 

The output is shown in the listing 08.

 

[

{

"partition" : {

"key" : [ "1" ],

"position" : 0

},

"rows" : [

{ "type" : "row",

"position" : 18,

"clustering" : [ "2" ],

"liveness_info" : { "tstamp" : "2016-08-13T16:20:01.400587Z" },

"cells" : [

{ "name" : "field03", "value" : "a" }

]

}

]

},

{

"partition" : {

"key" : [ "2" ],

"position" : 31

},

"rows" : [

{ "type" : "row",

"position" : 49,

"clustering" : [ "3" ],

"liveness_info" : { "tstamp" : "2016-08-13T16:23:36.112425Z" },

"cells" : [

{ "name" : "field03", "value" : "b" }

]

}

]

}

]

 

Listing 08

 

You can see that internally, the data is spread by partitions. In this case, the partition key spreads by 1 and 2. Inside each partition, we have a list of rows ordered/clustered by the cluster key. In this case, the field02 is the cluster key and create the first ordered map. And finally, we have the inner map for associating the field name to its underlying value. In our case, the only field storing real data is the field03. You can see the timestamp for each row, and just remember in the previous articles, this timestamp is used to compute the most recent data for each row.

 

You can also update and delete rows in a Cassandra table as shown in the listing 09.

 

cqlsh:sampledb> UPDATE example SET field03='d' WHERE field01=1;

 

cqlsh:sampledb> DELETE FROM example WHERE field01=2;

 

Listing 09

 

Conclusion

 

In this article, I've illustrated using real world examples how to work with Cassandra using its query language. Now you can apply these scripts in order to query your own Cassandra cluster.