�ݺ�ߣ

Big Data: NoSQL & the DBA

– Aswani Vonteddu
Aswani Vonteddu

The evolution of data stores

• Data modeling
• Data from the Developer’s standpoint
• Data from the DBA’s standpoint
• Impedance mismatch and the rise of ORM

Aswani Vonteddu

Hierarchical object graph model

Source: A co-Relational Model of Data for Large Shared Data Banks by Erik Meijer and Gavin
Bierman

Aswani Vonteddu

Normalized for tables in RDBMS

Bierman

Aswani Vonteddu

Data – Summary
• In order to use an RDBMS,

– Designer to model data into tables

– Developer must normalize/de-no

– DBA has to speed up queries

Aswani Vonteddu

Impedance mismatch and the rise of ORMs (like
Hibernate)
[Table(name="Products")] [Table(name="Keywords")]
class Product class Keyword
{ {
[Column(PrimaryKey=true)]int ID;
[Column]string Title; [Column(PrimaryKey=true)]int ID;
[Column]string Author; [Column]string Keyword;
[Column]int Year; [Column(IsForeignKey=true)]int ProductID;
[Column]int Pages; }
private EntitySet<Rating> _Ratings;
[ [Table(name="Ratings")]
Association( Storage="_Ratings", class Rating
ThisKey="ID", {
OtherKey="ProductID“,
DeleteRule="ONDELETECASCADE“ [Column(PrimaryKey=true)]int ID;
) [Column]string Rating;
] [Column(IsForeignKey=true)]int ProductID;
ICollection<Rating> Ratings{ ... } }

private EntitySet<Keyword> _Keywords;
[…]
ICollection<Keyword> Keywords{ ... }
}

Bierman

Aswani Vonteddu

o So what is Big Data?

o Sources

o Applications

o Technologies

Aswani Vonteddu

What is Big Data?
• It is not a technology in itself.

• It is information about everything that is
happening around us, every where and every
minute

• Almost all of us have contributed to Big Data
with/with out our knowledge already, and we will
continue to be doing that.

• Un-structured

Aswani Vonteddu

The four characteristics
• Volume

• Velocity

• Variety

• Veracity

Aswani Vonteddu

Sources

• Clickstream

• Tweets

• Facebook: pictures and comments

• Sensors
A Boeing 737 generates 240 TB of data
during a single cross country flight.

Aswani Vonteddu

Applications
• Classification/Ontologies

• Crowdsourcing - CAPTCHA

• Natural language processing (NLP) –
Google translate

• Visualization – Facebook map

Aswani Vonteddu

Setting up a Big Data platform

• A Big Data platform must be equipped
with technologies for the following stages
of data processing:

• Acquisition
• Organization
• Analysis

Aswani Vonteddu

Technologies

• Acquisition
– NoSQL databases (DynamoDB, Cassandra)
• Very high speed writes

• Organization & Analysis
– Map Reduce (Apache Hadoop)
• Code to Data, not otherwise
• Map function and Reduce function together
perform the desired analysis

Aswani Vonteddu

NoSQL and why now?
• RDBMSs must ensure ACID properties

• CAP theorem says that all three of
Consistency, Availability and Partition tolerance
cannot be guaranteed by any distributed
system

• NoSQL databases are distributed, and are
better options than RDBMS for applications
that can deal with lack of one of those
properties.
Aswani Vonteddu

Relational Databases
• Random disk access

• Data model is totally structured, and
predefined

• Shared Everything architecture – Single
point of failure

Aswani Vonteddu

NoSQL categories
• Graph DB

• Column families

• Document

Aswani Vonteddu

Simple Key-Value stores
• Distributed Hash Tables

• Eventual consistency

• Replication and Data partitioning

• Example
Amazon Dynamo

Aswani Vonteddu

Column families
• Distributed Key-Value stores

• Supports nested columns

• Example
Cassandra

Aswani Vonteddu

Apache Cassandra

• Indexed by a Key
• Supports columns and super-columns
• Allows structured/un-structured data

Aswani Vonteddu

Cassandra
N
1

N N
4 2

N
3

Aswani Vonteddu

Cassandra
Coordinator

N
1
3. Success

1. ConsistencyLevel.ONE

2. Write
request 2. Write
N request N
4 2
Replica node Responsible node

N
3

Aswani Vonteddu

Cassandra
Coordinator

N
1
3. Success

1. ConsistencyLevel.ONE

2. Write
request 2. Write
4. Success N N
request
4 2

N
3

Aswani Vonteddu

Cassandra
Coordinator

N
1
3 or 4. Success

3 or 4. Success
1. ConsistencyLevel.TWO

2. Write
request 2. Write
N request N
4 2

N
3

Aswani Vonteddu

Cassandra
Coordinator

N
1
3 or 4. Success

3 or 4. Success
1. ConsistencyLevel.TWO

2. Write
request 2. Write
5. Success N N
request
4 2

N
3

Aswani Vonteddu

Cassandra
• Write operation:
– Commit log
– Memtable – In-Memory storage structure
(kind of a hash table)
– SSTable on disk
– Compaction

Aswani Vonteddu

Cassandra
• Read operation:
– Coordinator node forwards the request
• to the node responsible
• And replica nodes based on the consistency level
requested
– Each node
• Looks up in the Memtable + all existing SSTables
• Takes the one with the latest timestamp.
– Bloom filters help speed up this operation

Aswani Vonteddu

Cassandra

Indexes:

• Primary index (on the key)
supported default by the
Cassandra engine
• Secondary indexes are to be
built as a new column family
with the column of interest
as the key Aswani Vonteddu

Document DBs
• Similar to Key-Value stores, but Values
are often documents (JSON, ION, …)

• Documents are versioned

• Example
DynamoDB

Aswani Vonteddu

Map Reduce
• Introduced by Google
• List processing system
• Scales to clusters with thousands of nodes
• And petabytes or Exabytes of data volumes
• Code is taken to data, not otherwise
• Data must be disjoint
• Maps the functions to nodes where the data
resides
• And Reduces the results from all nodes to build
the final result
• Example: Hadoop

Aswani Vonteddu

Techniques & algorithms..
• Vector Clocks
• Hinted handoff
• Read repair
• Anti-entropy repair

Aswani Vonteddu

Big Data talent
• Deep analytical
– Mathematicians, Operations research
analysts, statisticians, ..
• Big data savvy
– Business and functional
managers, budget, credit and financial
analysts
• Supporting Technology
– DBAs, System & Network administrators, and
Programmers
Aswani Vonteddu

The DBA’s role here?
• Tremendous opportunity for the DBAs

• Like in the early 90’s when businesses
migrated from mainframes to Oracle/SQL
Server/DB2

• Where?
– Data modeling:
Vast amounts of data, re-designing DHTs is
harder than re-designing RDBMS by multiple
folds since data migration is painful

Aswani Vonteddu

References
[1] McKinsey, Big data: The next frontier for
innovation, competition and productivity
[2] IDC, The rise of Big Data: Managing, Storing and gaining
value from endless information
• Others
– http://slidesha.re/LF8umk
– http://slidesha.re/LF8vGY

Aswani Vonteddu

�ݺ�ߣ

Big Data: NoSQL & the DBA

More Related Content

Big Data: NoSQL & the DBA

Editor's Notes