This document discusses the rise of big data and NoSQL databases. It provides an overview of what big data is, its key characteristics of volume, velocity, variety and veracity. It also discusses some sources and applications of big data. With respect to databases, it covers the limitations of relational databases for big data use cases and provides an introduction to different categories of NoSQL databases like key-value stores, column families and document databases. It uses Cassandra as an example to explain column family databases in more detail.
1 of 34
More Related Content
Big Data: NoSQL & the DBA
1. Big Data: NoSQL & the DBA
– Aswani Vonteddu
Aswani Vonteddu
2. The evolution of data stores
• Data modeling
• Data from the Developer’s standpoint
• Data from the DBA’s standpoint
• Impedance mismatch and the rise of ORM
Aswani Vonteddu
3. Hierarchical object graph model
Source: A co-Relational Model of Data for Large Shared Data Banks by Erik Meijer and Gavin
Bierman
Aswani Vonteddu
4. Normalized for tables in RDBMS
Source: A co-Relational Model of Data for Large Shared Data Banks by Erik Meijer and Gavin
Bierman
Aswani Vonteddu
5. Data – Summary
• In order to use an RDBMS,
– Designer to model data into tables
– Developer must normalize/de-no
– DBA has to speed up queries
Aswani Vonteddu
6. Impedance mismatch and the rise of ORMs (like
Hibernate)
[Table(name="Products")] [Table(name="Keywords")]
class Product class Keyword
{ {
[Column(PrimaryKey=true)]int ID;
[Column]string Title; [Column(PrimaryKey=true)]int ID;
[Column]string Author; [Column]string Keyword;
[Column]int Year; [Column(IsForeignKey=true)]int ProductID;
[Column]int Pages; }
private EntitySet<Rating> _Ratings;
[ [Table(name="Ratings")]
Association( Storage="_Ratings", class Rating
ThisKey="ID", {
OtherKey="ProductID“,
DeleteRule="ONDELETECASCADE“ [Column(PrimaryKey=true)]int ID;
) [Column]string Rating;
] [Column(IsForeignKey=true)]int ProductID;
ICollection<Rating> Ratings{ ... } }
private EntitySet<Keyword> _Keywords;
[…]
ICollection<Keyword> Keywords{ ... }
}
Source: A co-Relational Model of Data for Large Shared Data Banks by Erik Meijer and Gavin
Bierman
Aswani Vonteddu
7. o So what is Big Data?
o Sources
o Applications
o Technologies
Aswani Vonteddu
8. What is Big Data?
• It is not a technology in itself.
• It is information about everything that is
happening around us, every where and every
minute
• Almost all of us have contributed to Big Data
with/with out our knowledge already, and we will
continue to be doing that.
• Un-structured
Aswani Vonteddu
10. Sources
• Clickstream
• Tweets
• Facebook: pictures and comments
• Sensors
A Boeing 737 generates 240 TB of data
during a single cross country flight.
Aswani Vonteddu
13. Setting up a Big Data platform
• A Big Data platform must be equipped
with technologies for the following stages
of data processing:
• Acquisition
• Organization
• Analysis
Aswani Vonteddu
14. Technologies
• Acquisition
– NoSQL databases (DynamoDB, Cassandra)
• Very high speed writes
• Organization & Analysis
– Map Reduce (Apache Hadoop)
• Code to Data, not otherwise
• Map function and Reduce function together
perform the desired analysis
Aswani Vonteddu
15. NoSQL and why now?
• RDBMSs must ensure ACID properties
• CAP theorem says that all three of
Consistency, Availability and Partition tolerance
cannot be guaranteed by any distributed
system
• NoSQL databases are distributed, and are
better options than RDBMS for applications
that can deal with lack of one of those
properties.
Aswani Vonteddu
16. Relational Databases
• Random disk access
• Data model is totally structured, and
predefined
• Shared Everything architecture – Single
point of failure
Aswani Vonteddu
22. Cassandra
Coordinator
N
1
3. Success
1. ConsistencyLevel.ONE
2. Write
request 2. Write
N request N
4 2
Replica node Responsible node
N
3
Aswani Vonteddu
23. Cassandra
Coordinator
N
1
3. Success
1. ConsistencyLevel.ONE
2. Write
request 2. Write
4. Success N N
request
4 2
Replica node Responsible node
N
3
Aswani Vonteddu
24. Cassandra
Coordinator
N
1
3 or 4. Success
3 or 4. Success
1. ConsistencyLevel.TWO
2. Write
request 2. Write
N request N
4 2
Replica node Responsible node
N
3
Aswani Vonteddu
25. Cassandra
Coordinator
N
1
3 or 4. Success
3 or 4. Success
1. ConsistencyLevel.TWO
2. Write
request 2. Write
5. Success N N
request
4 2
Replica node Responsible node
N
3
Aswani Vonteddu
26. Cassandra
• Write operation:
– Commit log
– Memtable – In-Memory storage structure
(kind of a hash table)
– SSTable on disk
– Compaction
Aswani Vonteddu
27. Cassandra
• Read operation:
– Coordinator node forwards the request
• to the node responsible
• And replica nodes based on the consistency level
requested
– Each node
• Looks up in the Memtable + all existing SSTables
• Takes the one with the latest timestamp.
– Bloom filters help speed up this operation
Aswani Vonteddu
28. Cassandra
Indexes:
• Primary index (on the key)
supported default by the
Cassandra engine
• Secondary indexes are to be
built as a new column family
with the column of interest
as the key Aswani Vonteddu
29. Document DBs
• Similar to Key-Value stores, but Values
are often documents (JSON, ION, …)
• Documents are versioned
• Example
DynamoDB
Aswani Vonteddu
30. Map Reduce
• Introduced by Google
• List processing system
• Scales to clusters with thousands of nodes
• And petabytes or Exabytes of data volumes
• Code is taken to data, not otherwise
• Data must be disjoint
• Maps the functions to nodes where the data
resides
• And Reduces the results from all nodes to build
the final result
• Example: Hadoop
Aswani Vonteddu
32. Big Data talent
• Deep analytical
– Mathematicians, Operations research
analysts, statisticians, ..
• Big data savvy
– Business and functional
managers, budget, credit and financial
analysts
• Supporting Technology
– DBAs, System & Network administrators, and
Programmers
Aswani Vonteddu
33. The DBA’s role here?
• Tremendous opportunity for the DBAs
• Like in the early 90’s when businesses
migrated from mainframes to Oracle/SQL
Server/DB2
• Where?
– Data modeling:
Vast amounts of data, re-designing DHTs is
harder than re-designing RDBMS by multiple
folds since data migration is painful
Aswani Vonteddu
34. References
[1] McKinsey, Big data: The next frontier for
innovation, competition and productivity
[2] IDC, The rise of Big Data: Managing, Storing and gaining
value from endless information
• Others
– http://slidesha.re/LF8umk
– http://slidesha.re/LF8vGY
Aswani Vonteddu
Editor's Notes
#12: Industries: Healthcare, Telecommunications, Retail, Manufacturing, Public sector