This document provides an overview of Cassandra, an open source distributed database. It discusses Cassandra's query language (CQL), which is similar to SQL but only supports Cassandra operations. It also covers Cassandra's tabular data model with rows, columns, and strong schemas. The document reviews tools for working with Cassandra and best practices for data modeling and application methodology, emphasizing denormalization and idempotency over transactions. It notes limitations of batches, secondary indices, and lightweight transactions in Cassandra.
1 of 20
Download to read offline
More Related Content
Cassandra: An Alien Technology That's not so Alien
2. Who am I?
Brian Hess
Sr. Product Manger for Analytics
Math and CS background
Distributed Systems
Algorithms
Government data mining research
Data warehousing (Netezza)
SQL and data mining and UDF fun
Joined Datastax 1.5 years ago
New to NoSQL and Cassandra
息 2015. All Rights Reserved. 2
5. Agenda
The Good
Query Language
Tooling
Conceptual Data
Model
The Different
Application
Methodology
Connections/Drivers
Data Modeling
The Dangerous
Batches
Secondary Indices
Lightweight
Transactions
息 2015. All Rights Reserved. 5
6. Cassandra Query Language
Very much like SQL
SELECT, INSERT, DELETE
WHERE clauses
CREATE TABLE, GRANT, etc, etc
SQL syntax for Cassandra operations
Only supports Cassandra operations
Not trying to cover the full SQL space
Missing several things:
JOIN, GROUP BY, windowed aggregates, subqueries, WITH, etc
息 2015. All Rights Reserved. 6
7. Cassandra Query Language
Pop Quiz: CQL or SQL:
1. SELECT a, b, c FROM myData;
2. SELECT a, b, c FROM myData ORDER BY a LIMIT 3;
3. SELECT cust_id, txn_id FROM txn WHERE cust_id=5;
4. SELECT sensor, MAX(temp) FROM txn WHERE sensor=3;
5. SELECT MAX(temp) FROM txn
6. SELECT a.id, a.info, b.moreinfo FROM a JOIN b ON
(a.id=b.id)
息 2015. All Rights Reserved. 7
8. Cassandra Query Language
Pop Quiz: CQL or SQL:
1. SELECT a, b, c FROM myData;
2. SELECT a, b, c FROM myData ORDER BY a LIMIT 3;
3. SELECT cust_id, txn_id FROM txn WHERE cust_id=5;
4. SELECT sensor, MAX(temp) FROM txn WHERE sensor=3;
5. SELECT MAX(temp) FROM txn
6. SELECT a.id, a.info, b.moreinfo FROM a JOIN b ON
(a.id=b.id)
息 2015. All Rights Reserved. 8
9. Tabular Data Model
Rows and Columns
Like SQL, R data.frame, etc
Strong schema
Each column has a data type
Custom data types including Map, Set, List, and UDTs
Can mimic old style thrift tables with Map
Most Cassandra tables really have a schema
Most data really has a schema
息 2015. All Rights Reserved. 9
10. Tooling
Cqlsh
Command-line CQL interpreter
Mainly used for management operations
CREATE KEYSPACE, CREATE TABLE, etc
GRANT, etc
Singleton INSERTs
Some light load/unload operations via COPY
息 2015. All Rights Reserved. 10
12. Data Modeling
Start with the query
How will you access this data
Then create the schema optimized for those queries
Store it multiple times if you must Query Tables
Uniqueness what is an overwrite versus an insert?
Partition keys for fast lookups
Clustering Columns for ranges, etc
Mix and match for different query patterns Users by ID, Users by Email, etc
No joins
So, no star schema
Denormalize!
息 2015. All Rights Reserved. 12
13. Application Methodology
Instead of rolling back, we retry
In SQL you try a transaction, and if error, then rollback
Leverages transaction isolation and rollback
In Cassandra you try, and if error, try again
Leverages idempotent data model and high availability
Well, did you want to write that to the database or not?
Instead of Transactions and Rollback, we have Idempotency and Retry
息 2015. All Rights Reserved. 13
14. Drivers / Connections
Multilanguage drivers
Java, Python, C/C++, PHP, Ruby, etc
Leverage the whole cluster
Connect to all the nodes connect to one and discovering the others
Load balancing Round Robin, etc
Smart routing of queries Token Aware Routing
Not JDBC / ODBC
Cassandra-specific, but similar
Cluster, Session, Statement, ResultSet, etc
Different data types, etc
Lower-level configurations number of connections, paging size, etc
息 2015. All Rights Reserved. 14
15. Batches
These are not what you think they are
How they work
Send all the statements to the coordinator
If logged, then a batchlog is written for durability and replicates it
Executes each statement involves other nodes (possibly)
They are not done as an isolated set of INSERTs
Well, they are if they all update the same partition (nuance)
Things will be seen as they are executed
They do not speed things up
The coordinator has to do all the work latency increase and timeouts
You still need to talk to multiple nodes for multiple INSERTs
Logged batches require a lot of work by the coordinator
Maintain a batchlog for durability and replicates it
But, they do serve a purpose
Inter-table consistency (logged batches)
Some bulk load performance (unlogged batches)
息 2015. All Rights Reserved. 15
16. Secondary Indices
These are not what you think they are
Secondary Indices need to consult every Cassandra node
Cassandra is optimized for single node queries
Okay for single-partition queries, but not really worth it
Maintaining secondary indices is also an overhead
Basically, they rarely make things faster in cases that matter
Consider Materialized Views in Cassandra 3.0
息 2015. All Rights Reserved. 16
17. Lightweight Transactions
These are not what you think they are
They are not SQL transactions
They do not support roll-back, etc
They do allow for serialization and isolation
Useful for operations like creating accounts
INSERT INTO users(username, email) VALUES
('cassandra_fan', 'abc@def.com') IF NOT EXISTS
They are costly
Paxos algorithm multiple passes/rounds
But they do serve a purpose
Use sparingly
息 2015. All Rights Reserved. 17