�ݺ�ߣ

Cassandra: An Alien Technology That’s not so Alien

Who am I?
• Brian Hess
– Sr. Product Manger for Analytics
• Math and CS background
– Distributed Systems
– Algorithms
• Government data mining research
• Data warehousing (Netezza)
– SQL and data mining and UDF fun
• Joined Datastax 1.5 years ago
– New to NoSQL and Cassandra
© 2015. All Rights Reserved. 2

Agenda
The Good
• Query Language
• Tooling
• Conceptual Data
Model
The Different
• Application
Methodology
• Connections/Drivers
• Data Modeling
The Dangerous
• Batches
• Secondary Indices
• Lightweight
Transactions

Cassandra Query Language
• Very much like SQL
– SELECT, INSERT, DELETE
– WHERE clauses
– CREATE TABLE, GRANT, etc, etc
• SQL syntax for Cassandra operations
– Only supports Cassandra operations
– Not trying to cover the full SQL space
• Missing several things:
– JOIN, GROUP BY, windowed aggregates, subqueries, WITH, etc

• Pop Quiz: CQL or SQL:
1. SELECT a, b, c FROM myData;
2. SELECT a, b, c FROM myData ORDER BY a LIMIT 3;
3. SELECT cust_id, txn_id FROM txn WHERE cust_id=5;
4. SELECT sensor, MAX(temp) FROM txn WHERE sensor=3;
5. SELECT MAX(temp) FROM txn
6. SELECT a.id, a.info, b.moreinfo FROM a JOIN b ON
(a.id=b.id)

Tabular Data Model
• Rows and Columns
– Like SQL, R data.frame, etc
• Strong schema
– Each column has a data type
– Custom data types including Map, Set, List, and UDTs
– Can mimic “old style thrift tables” with Map
• Most Cassandra tables really have a schema
– Most data really has a schema

Tooling
• Cqlsh
– Command-line CQL interpreter
– Mainly used for management operations
• CREATE KEYSPACE, CREATE TABLE, etc
• GRANT, etc
– Singleton INSERTs
– Some light load/unload operations via COPY

Tooling
• DevCenter
– DataStax tool
– “Toad for Cassandra”

Data Modeling
• Start with the query
– How will you access this data
– Then create the schema optimized for those queries
– Store it multiple times if you must – “Query Tables”
• Uniqueness – what is an overwrite versus an insert?
– Partition keys – for fast lookups
– Clustering Columns – for ranges, etc
– Mix and match for different query patterns – Users by ID, Users by Email, etc
• No joins
– So, no star schema
– Denormalize!

Application Methodology
• Instead of rolling back, we retry
• In SQL you try a transaction, and if error, then rollback
– Leverages transaction isolation and rollback
• In Cassandra you try, and if error, try again
– Leverages idempotent data model and high availability
• “Well, did you want to write that to the database or not?”
– “Instead of Transactions and Rollback, we have Idempotency and Retry”

Drivers / Connections
• Multilanguage drivers
– Java, Python, C/C++, PHP, Ruby, etc
• Leverage the whole cluster
– Connect to all the nodes – connect to one and discovering the others
– Load balancing – Round Robin, etc
– Smart routing of queries – “Token Aware Routing”
• Not JDBC / ODBC
– Cassandra-specific, but similar
– Cluster, Session, Statement, ResultSet, etc
– Different data types, etc
– Lower-level configurations – number of connections, paging size, etc

Batches
• These are not what you think they are
• How they work…
– Send all the statements to the coordinator
– If “logged”, then a batchlog is written for durability – and replicates it
– Executes each statement – involves other nodes (possibly)
• They are not done as an isolated set of INSERTs
– Well, they are if they all update the same partition (nuance)
– Things will be seen as they are executed
• They do not speed things up
– The coordinator has to do all the work – latency increase and timeouts
– You still need to talk to multiple nodes for multiple INSERTs
– Logged batches require a lot of work by the coordinator
• Maintain a batchlog for durability – and replicates it
• But, they do serve a purpose
– Inter-table consistency (logged batches)
– Some bulk load performance (unlogged batches)

Secondary Indices
• Secondary Indices need to consult every Cassandra node
– Cassandra is optimized for “single node queries”
– Okay for single-partition queries, but not really worth it
• Maintaining secondary indices is also an overhead
• Basically, they rarely make things faster in cases that matter
• Consider Materialized Views in Cassandra 3.0

Lightweight Transactions
• They are not SQL transactions
– They do not support roll-back, etc
• They do allow for serialization and isolation
– Useful for operations like creating accounts
– INSERT INTO users(username, email) VALUES
('cassandra_fan', 'abc@def.com') IF NOT EXISTS
• They are costly
– Paxos algorithm – multiple passes/rounds
• But they do serve a purpose
– Use sparingly

Cassandra?

Cassandra!

�ݺ�ߣ

Cassandra: An Alien Technology That's not so Alien

More Related Content

Cassandra: An Alien Technology That's not so Alien