際際滷

際際滷Share a Scribd company logo
Cassandra: An Alien Technology Thats not so Alien
Who am I?
 Brian Hess
 Sr. Product Manger for Analytics
 Math and CS background
 Distributed Systems
 Algorithms
 Government data mining research
 Data warehousing (Netezza)
 SQL and data mining and UDF fun
 Joined Datastax 1.5 years ago
 New to NoSQL and Cassandra
息 2015. All Rights Reserved. 2
息 2015. All Rights Reserved. 3
息 2015. All Rights Reserved. 4
Agenda
The Good
 Query Language
 Tooling
 Conceptual Data
Model
The Different
 Application
Methodology
 Connections/Drivers
 Data Modeling
The Dangerous
 Batches
 Secondary Indices
 Lightweight
Transactions
息 2015. All Rights Reserved. 5
Cassandra Query Language
 Very much like SQL
 SELECT, INSERT, DELETE
 WHERE clauses
 CREATE TABLE, GRANT, etc, etc
 SQL syntax for Cassandra operations
 Only supports Cassandra operations
 Not trying to cover the full SQL space
 Missing several things:
 JOIN, GROUP BY, windowed aggregates, subqueries, WITH, etc
息 2015. All Rights Reserved. 6
Cassandra Query Language
 Pop Quiz: CQL or SQL:
1. SELECT a, b, c FROM myData;
2. SELECT a, b, c FROM myData ORDER BY a LIMIT 3;
3. SELECT cust_id, txn_id FROM txn WHERE cust_id=5;
4. SELECT sensor, MAX(temp) FROM txn WHERE sensor=3;
5. SELECT MAX(temp) FROM txn
6. SELECT a.id, a.info, b.moreinfo FROM a JOIN b ON
(a.id=b.id)
息 2015. All Rights Reserved. 7
Cassandra Query Language
 Pop Quiz: CQL or SQL:
1. SELECT a, b, c FROM myData;
2. SELECT a, b, c FROM myData ORDER BY a LIMIT 3;
3. SELECT cust_id, txn_id FROM txn WHERE cust_id=5;
4. SELECT sensor, MAX(temp) FROM txn WHERE sensor=3;
5. SELECT MAX(temp) FROM txn
6. SELECT a.id, a.info, b.moreinfo FROM a JOIN b ON
(a.id=b.id)
息 2015. All Rights Reserved. 8
Tabular Data Model
 Rows and Columns
 Like SQL, R data.frame, etc
 Strong schema
 Each column has a data type
 Custom data types including Map, Set, List, and UDTs
 Can mimic old style thrift tables with Map
 Most Cassandra tables really have a schema
 Most data really has a schema
息 2015. All Rights Reserved. 9
Tooling
 Cqlsh
 Command-line CQL interpreter
 Mainly used for management operations
 CREATE KEYSPACE, CREATE TABLE, etc
 GRANT, etc
 Singleton INSERTs
 Some light load/unload operations via COPY
息 2015. All Rights Reserved. 10
Tooling
 DevCenter
 DataStax tool
 Toad for Cassandra
息 2015. All Rights Reserved. 11
Data Modeling
 Start with the query
 How will you access this data
 Then create the schema optimized for those queries
 Store it multiple times if you must  Query Tables
 Uniqueness  what is an overwrite versus an insert?
 Partition keys  for fast lookups
 Clustering Columns  for ranges, etc
 Mix and match for different query patterns  Users by ID, Users by Email, etc
 No joins
 So, no star schema
 Denormalize!
息 2015. All Rights Reserved. 12
Application Methodology
 Instead of rolling back, we retry
 In SQL you try a transaction, and if error, then rollback
 Leverages transaction isolation and rollback
 In Cassandra you try, and if error, try again
 Leverages idempotent data model and high availability
 Well, did you want to write that to the database or not?
 Instead of Transactions and Rollback, we have Idempotency and Retry
息 2015. All Rights Reserved. 13
Drivers / Connections
 Multilanguage drivers
 Java, Python, C/C++, PHP, Ruby, etc
 Leverage the whole cluster
 Connect to all the nodes  connect to one and discovering the others
 Load balancing  Round Robin, etc
 Smart routing of queries  Token Aware Routing
 Not JDBC / ODBC
 Cassandra-specific, but similar
 Cluster, Session, Statement, ResultSet, etc
 Different data types, etc
 Lower-level configurations  number of connections, paging size, etc
息 2015. All Rights Reserved. 14
Batches
 These are not what you think they are
 How they work
 Send all the statements to the coordinator
 If logged, then a batchlog is written for durability  and replicates it
 Executes each statement  involves other nodes (possibly)
 They are not done as an isolated set of INSERTs
 Well, they are if they all update the same partition (nuance)
 Things will be seen as they are executed
 They do not speed things up
 The coordinator has to do all the work  latency increase and timeouts
 You still need to talk to multiple nodes for multiple INSERTs
 Logged batches require a lot of work by the coordinator
 Maintain a batchlog for durability  and replicates it
 But, they do serve a purpose
 Inter-table consistency (logged batches)
 Some bulk load performance (unlogged batches)
息 2015. All Rights Reserved. 15
Secondary Indices
 These are not what you think they are
 Secondary Indices need to consult every Cassandra node
 Cassandra is optimized for single node queries
 Okay for single-partition queries, but not really worth it
 Maintaining secondary indices is also an overhead
 Basically, they rarely make things faster in cases that matter
 Consider Materialized Views in Cassandra 3.0
息 2015. All Rights Reserved. 16
Lightweight Transactions
 These are not what you think they are
 They are not SQL transactions
 They do not support roll-back, etc
 They do allow for serialization and isolation
 Useful for operations like creating accounts
 INSERT INTO users(username, email) VALUES
('cassandra_fan', 'abc@def.com') IF NOT EXISTS
 They are costly
 Paxos algorithm  multiple passes/rounds
 But they do serve a purpose
 Use sparingly
息 2015. All Rights Reserved. 17
Thank you
Cassandra?
息 2015. All Rights Reserved. 19
Cassandra!
息 2015. All Rights Reserved. 20

More Related Content

Cassandra: An Alien Technology That's not so Alien

  • 1. Cassandra: An Alien Technology Thats not so Alien
  • 2. Who am I? Brian Hess Sr. Product Manger for Analytics Math and CS background Distributed Systems Algorithms Government data mining research Data warehousing (Netezza) SQL and data mining and UDF fun Joined Datastax 1.5 years ago New to NoSQL and Cassandra 息 2015. All Rights Reserved. 2
  • 3. 息 2015. All Rights Reserved. 3
  • 4. 息 2015. All Rights Reserved. 4
  • 5. Agenda The Good Query Language Tooling Conceptual Data Model The Different Application Methodology Connections/Drivers Data Modeling The Dangerous Batches Secondary Indices Lightweight Transactions 息 2015. All Rights Reserved. 5
  • 6. Cassandra Query Language Very much like SQL SELECT, INSERT, DELETE WHERE clauses CREATE TABLE, GRANT, etc, etc SQL syntax for Cassandra operations Only supports Cassandra operations Not trying to cover the full SQL space Missing several things: JOIN, GROUP BY, windowed aggregates, subqueries, WITH, etc 息 2015. All Rights Reserved. 6
  • 7. Cassandra Query Language Pop Quiz: CQL or SQL: 1. SELECT a, b, c FROM myData; 2. SELECT a, b, c FROM myData ORDER BY a LIMIT 3; 3. SELECT cust_id, txn_id FROM txn WHERE cust_id=5; 4. SELECT sensor, MAX(temp) FROM txn WHERE sensor=3; 5. SELECT MAX(temp) FROM txn 6. SELECT a.id, a.info, b.moreinfo FROM a JOIN b ON (a.id=b.id) 息 2015. All Rights Reserved. 7
  • 8. Cassandra Query Language Pop Quiz: CQL or SQL: 1. SELECT a, b, c FROM myData; 2. SELECT a, b, c FROM myData ORDER BY a LIMIT 3; 3. SELECT cust_id, txn_id FROM txn WHERE cust_id=5; 4. SELECT sensor, MAX(temp) FROM txn WHERE sensor=3; 5. SELECT MAX(temp) FROM txn 6. SELECT a.id, a.info, b.moreinfo FROM a JOIN b ON (a.id=b.id) 息 2015. All Rights Reserved. 8
  • 9. Tabular Data Model Rows and Columns Like SQL, R data.frame, etc Strong schema Each column has a data type Custom data types including Map, Set, List, and UDTs Can mimic old style thrift tables with Map Most Cassandra tables really have a schema Most data really has a schema 息 2015. All Rights Reserved. 9
  • 10. Tooling Cqlsh Command-line CQL interpreter Mainly used for management operations CREATE KEYSPACE, CREATE TABLE, etc GRANT, etc Singleton INSERTs Some light load/unload operations via COPY 息 2015. All Rights Reserved. 10
  • 11. Tooling DevCenter DataStax tool Toad for Cassandra 息 2015. All Rights Reserved. 11
  • 12. Data Modeling Start with the query How will you access this data Then create the schema optimized for those queries Store it multiple times if you must Query Tables Uniqueness what is an overwrite versus an insert? Partition keys for fast lookups Clustering Columns for ranges, etc Mix and match for different query patterns Users by ID, Users by Email, etc No joins So, no star schema Denormalize! 息 2015. All Rights Reserved. 12
  • 13. Application Methodology Instead of rolling back, we retry In SQL you try a transaction, and if error, then rollback Leverages transaction isolation and rollback In Cassandra you try, and if error, try again Leverages idempotent data model and high availability Well, did you want to write that to the database or not? Instead of Transactions and Rollback, we have Idempotency and Retry 息 2015. All Rights Reserved. 13
  • 14. Drivers / Connections Multilanguage drivers Java, Python, C/C++, PHP, Ruby, etc Leverage the whole cluster Connect to all the nodes connect to one and discovering the others Load balancing Round Robin, etc Smart routing of queries Token Aware Routing Not JDBC / ODBC Cassandra-specific, but similar Cluster, Session, Statement, ResultSet, etc Different data types, etc Lower-level configurations number of connections, paging size, etc 息 2015. All Rights Reserved. 14
  • 15. Batches These are not what you think they are How they work Send all the statements to the coordinator If logged, then a batchlog is written for durability and replicates it Executes each statement involves other nodes (possibly) They are not done as an isolated set of INSERTs Well, they are if they all update the same partition (nuance) Things will be seen as they are executed They do not speed things up The coordinator has to do all the work latency increase and timeouts You still need to talk to multiple nodes for multiple INSERTs Logged batches require a lot of work by the coordinator Maintain a batchlog for durability and replicates it But, they do serve a purpose Inter-table consistency (logged batches) Some bulk load performance (unlogged batches) 息 2015. All Rights Reserved. 15
  • 16. Secondary Indices These are not what you think they are Secondary Indices need to consult every Cassandra node Cassandra is optimized for single node queries Okay for single-partition queries, but not really worth it Maintaining secondary indices is also an overhead Basically, they rarely make things faster in cases that matter Consider Materialized Views in Cassandra 3.0 息 2015. All Rights Reserved. 16
  • 17. Lightweight Transactions These are not what you think they are They are not SQL transactions They do not support roll-back, etc They do allow for serialization and isolation Useful for operations like creating accounts INSERT INTO users(username, email) VALUES ('cassandra_fan', 'abc@def.com') IF NOT EXISTS They are costly Paxos algorithm multiple passes/rounds But they do serve a purpose Use sparingly 息 2015. All Rights Reserved. 17
  • 19. Cassandra? 息 2015. All Rights Reserved. 19
  • 20. Cassandra! 息 2015. All Rights Reserved. 20