ºÝºÝߣ

ºÝºÝߣShare a Scribd company logo
Big Data: NoSQL & the DBA




– Aswani Vonteddu
                Aswani Vonteddu
The evolution of data stores

•   Data modeling
•   Data from the Developer’s standpoint
•   Data from the DBA’s standpoint
•   Impedance mismatch and the rise of ORM




                   Aswani Vonteddu
Hierarchical object graph model




Source: A co-Relational Model of Data for Large Shared Data Banks by Erik Meijer and Gavin
Bierman


                                     Aswani Vonteddu
Normalized for tables in RDBMS




Source: A co-Relational Model of Data for Large Shared Data Banks by Erik Meijer and Gavin
Bierman


                                     Aswani Vonteddu
Data – Summary
• In order to use an RDBMS,

  – Designer to model data into tables

  – Developer must normalize/de-no

  – DBA has to speed up queries



                    Aswani Vonteddu
Impedance mismatch and the rise of ORMs (like
                      Hibernate)
[Table(name="Products")]                                            [Table(name="Keywords")]
class Product                                                       class Keyword
{                                                                   {
    [Column(PrimaryKey=true)]int ID;
    [Column]string Title;                                                 [Column(PrimaryKey=true)]int ID;
    [Column]string Author;                                                [Column]string Keyword;
    [Column]int Year;                                                     [Column(IsForeignKey=true)]int ProductID;
    [Column]int Pages;                                              }
    private EntitySet<Rating> _Ratings;
    [                                                               [Table(name="Ratings")]
           Association( Storage="_Ratings",                         class Rating
                      ThisKey="ID",                                 {
                      OtherKey="ProductID“,
                      DeleteRule="ONDELETECASCADE“                        [Column(PrimaryKey=true)]int ID;
                      )                                                   [Column]string Rating;
    ]                                                                     [Column(IsForeignKey=true)]int ProductID;
    ICollection<Rating> Ratings{ ... }                              }

    private EntitySet<Keyword> _Keywords;
    […]
    ICollection<Keyword> Keywords{ ... }
}


               Source: A co-Relational Model of Data for Large Shared Data Banks by Erik Meijer and Gavin
               Bierman


                                                    Aswani Vonteddu
o So what is Big Data?

o Sources

o Applications

o Technologies


                  Aswani Vonteddu
What is Big Data?
• It is not a technology in itself.

• It is information about everything that is
  happening around us, every where and every
  minute

• Almost all of us have contributed to Big Data
  with/with out our knowledge already, and we will
  continue to be doing that.

• Un-structured

                        Aswani Vonteddu
The four characteristics
• Volume

• Velocity

• Variety

• Veracity

                Aswani Vonteddu
Sources

• Clickstream

• Tweets

• Facebook: pictures and comments

• Sensors
  A Boeing 737 generates 240 TB of data
  during a single cross country flight.

                     Aswani Vonteddu
Applications
• Classification/Ontologies

• Crowdsourcing - CAPTCHA

• Natural language processing (NLP) –
  Google translate

• Visualization – Facebook map

                  Aswani Vonteddu
Aswani Vonteddu
Setting up a Big Data platform

• A Big Data platform must be equipped
  with technologies for the following stages
  of data processing:

• Acquisition
• Organization
• Analysis


                   Aswani Vonteddu
Technologies

• Acquisition
  – NoSQL databases (DynamoDB, Cassandra)
    • Very high speed writes


• Organization & Analysis
  – Map Reduce (Apache Hadoop)
    • Code to Data, not otherwise
    • Map function and Reduce function together
      perform the desired analysis

                    Aswani Vonteddu
NoSQL and why now?
• RDBMSs must ensure ACID properties

• CAP theorem says that all three of
  Consistency, Availability and Partition tolerance
  cannot be guaranteed by any distributed
  system

• NoSQL databases are distributed, and are
  better options than RDBMS for applications
  that can deal with lack of one of those
  properties.
                      Aswani Vonteddu
Relational Databases
• Random disk access

• Data model is totally structured, and
  predefined

• Shared Everything architecture – Single
  point of failure


                   Aswani Vonteddu
NoSQL categories
• Graph DB

• Column families

• Document




                    Aswani Vonteddu
Simple Key-Value stores
• Distributed Hash Tables

• Eventual consistency

• Replication and Data partitioning

• Example
  Amazon Dynamo

                  Aswani Vonteddu
Column families
• Distributed Key-Value stores

• Supports nested columns

• Example
  Cassandra



                  Aswani Vonteddu
Apache Cassandra

• Indexed by a Key
• Supports columns and super-columns
• Allows structured/un-structured data




                 Aswani Vonteddu
Cassandra
          N
          1




N                       N
4                       2




          N
          3




      Aswani Vonteddu
Cassandra
                                                           Coordinator

                                                                  N
                                                                  1
                                                                             3. Success


1. ConsistencyLevel.ONE


                                    2. Write
                                    request                                2. Write
                                        N                                  request        N
                                        4                                                 2
                                            Replica node              Responsible node




                                                                  N
                                                                  3




                          Aswani Vonteddu
Cassandra
                                                           Coordinator

                                                                  N
                                                                  1
                                                                             3. Success


1. ConsistencyLevel.ONE


                                    2. Write
                                    request                                2. Write
       4. Success                       N                                                 N
                                                                           request
                                        4                                                 2
                                            Replica node              Responsible node




                                                                  N
                                                                  3




                          Aswani Vonteddu
Cassandra
                                                            Coordinator

                                                                   N
                                                                   1
                                                                               3 or 4. Success

                                                           3 or 4. Success
1. ConsistencyLevel.TWO


                                    2. Write
                                    request                                  2. Write
                                        N                                    request             N
                                        4                                                        2
                                            Replica node               Responsible node




                                                                   N
                                                                   3




                          Aswani Vonteddu
Cassandra
                                                            Coordinator

                                                                   N
                                                                   1
                                                                               3 or 4. Success

                                                           3 or 4. Success
1. ConsistencyLevel.TWO


                                    2. Write
                                    request                                  2. Write
       5. Success                       N                                                        N
                                                                             request
                                        4                                                        2
                                            Replica node               Responsible node




                                                                   N
                                                                   3




                          Aswani Vonteddu
Cassandra
• Write operation:
  – Commit log
  – Memtable – In-Memory storage structure
    (kind of a hash table)
  – SSTable on disk
  – Compaction




                     Aswani Vonteddu
Cassandra
• Read operation:
  – Coordinator node forwards the request
    • to the node responsible
    • And replica nodes based on the consistency level
      requested
  – Each node
    • Looks up in the Memtable + all existing SSTables
    • Takes the one with the latest timestamp.
  – Bloom filters help speed up this operation

                      Aswani Vonteddu
Cassandra




Indexes:

• Primary index (on the key)
  supported default by the
  Cassandra engine
• Secondary indexes are to be
  built as a new column family
  with the column of interest
  as the key                     Aswani Vonteddu
Document DBs
• Similar to Key-Value stores, but Values
  are often documents (JSON, ION, …)

• Documents are versioned

• Example
  DynamoDB


                  Aswani Vonteddu
Map Reduce
• Introduced by Google
• List processing system
• Scales to clusters with thousands of nodes
• And petabytes or Exabytes of data volumes
• Code is taken to data, not otherwise
• Data must be disjoint
• Maps the functions to nodes where the data
  resides
• And Reduces the results from all nodes to build
  the final result
• Example: Hadoop

                      Aswani Vonteddu
Techniques & algorithms..
•   Vector Clocks
•   Hinted handoff
•   Read repair
•   Anti-entropy repair




                    Aswani Vonteddu
Big Data talent
• Deep analytical
  – Mathematicians, Operations research
    analysts, statisticians, ..
• Big data savvy
  – Business and functional
    managers, budget, credit and financial
    analysts
• Supporting Technology
  – DBAs, System & Network administrators, and
    Programmers
                    Aswani Vonteddu
The DBA’s role here?
• Tremendous opportunity for the DBAs

• Like in the early 90’s when businesses
  migrated from mainframes to Oracle/SQL
  Server/DB2

• Where?
  – Data modeling:
    Vast amounts of data, re-designing DHTs is
    harder than re-designing RDBMS by multiple
    folds since data migration is painful

                    Aswani Vonteddu
References
[1] McKinsey, Big data: The next frontier for
innovation, competition and productivity
[2] IDC, The rise of Big Data: Managing, Storing and gaining
value from endless information
• Others
   – http://slidesha.re/LF8umk
   – http://slidesha.re/LF8vGY




                             Aswani Vonteddu

More Related Content

Big Data: NoSQL & the DBA

  • 1. Big Data: NoSQL & the DBA – Aswani Vonteddu Aswani Vonteddu
  • 2. The evolution of data stores • Data modeling • Data from the Developer’s standpoint • Data from the DBA’s standpoint • Impedance mismatch and the rise of ORM Aswani Vonteddu
  • 3. Hierarchical object graph model Source: A co-Relational Model of Data for Large Shared Data Banks by Erik Meijer and Gavin Bierman Aswani Vonteddu
  • 4. Normalized for tables in RDBMS Source: A co-Relational Model of Data for Large Shared Data Banks by Erik Meijer and Gavin Bierman Aswani Vonteddu
  • 5. Data – Summary • In order to use an RDBMS, – Designer to model data into tables – Developer must normalize/de-no – DBA has to speed up queries Aswani Vonteddu
  • 6. Impedance mismatch and the rise of ORMs (like Hibernate) [Table(name="Products")] [Table(name="Keywords")] class Product class Keyword { { [Column(PrimaryKey=true)]int ID; [Column]string Title; [Column(PrimaryKey=true)]int ID; [Column]string Author; [Column]string Keyword; [Column]int Year; [Column(IsForeignKey=true)]int ProductID; [Column]int Pages; } private EntitySet<Rating> _Ratings; [ [Table(name="Ratings")] Association( Storage="_Ratings", class Rating ThisKey="ID", { OtherKey="ProductID“, DeleteRule="ONDELETECASCADE“ [Column(PrimaryKey=true)]int ID; ) [Column]string Rating; ] [Column(IsForeignKey=true)]int ProductID; ICollection<Rating> Ratings{ ... } } private EntitySet<Keyword> _Keywords; […] ICollection<Keyword> Keywords{ ... } } Source: A co-Relational Model of Data for Large Shared Data Banks by Erik Meijer and Gavin Bierman Aswani Vonteddu
  • 7. o So what is Big Data? o Sources o Applications o Technologies Aswani Vonteddu
  • 8. What is Big Data? • It is not a technology in itself. • It is information about everything that is happening around us, every where and every minute • Almost all of us have contributed to Big Data with/with out our knowledge already, and we will continue to be doing that. • Un-structured Aswani Vonteddu
  • 9. The four characteristics • Volume • Velocity • Variety • Veracity Aswani Vonteddu
  • 10. Sources • Clickstream • Tweets • Facebook: pictures and comments • Sensors A Boeing 737 generates 240 TB of data during a single cross country flight. Aswani Vonteddu
  • 11. Applications • Classification/Ontologies • Crowdsourcing - CAPTCHA • Natural language processing (NLP) – Google translate • Visualization – Facebook map Aswani Vonteddu
  • 13. Setting up a Big Data platform • A Big Data platform must be equipped with technologies for the following stages of data processing: • Acquisition • Organization • Analysis Aswani Vonteddu
  • 14. Technologies • Acquisition – NoSQL databases (DynamoDB, Cassandra) • Very high speed writes • Organization & Analysis – Map Reduce (Apache Hadoop) • Code to Data, not otherwise • Map function and Reduce function together perform the desired analysis Aswani Vonteddu
  • 15. NoSQL and why now? • RDBMSs must ensure ACID properties • CAP theorem says that all three of Consistency, Availability and Partition tolerance cannot be guaranteed by any distributed system • NoSQL databases are distributed, and are better options than RDBMS for applications that can deal with lack of one of those properties. Aswani Vonteddu
  • 16. Relational Databases • Random disk access • Data model is totally structured, and predefined • Shared Everything architecture – Single point of failure Aswani Vonteddu
  • 17. NoSQL categories • Graph DB • Column families • Document Aswani Vonteddu
  • 18. Simple Key-Value stores • Distributed Hash Tables • Eventual consistency • Replication and Data partitioning • Example Amazon Dynamo Aswani Vonteddu
  • 19. Column families • Distributed Key-Value stores • Supports nested columns • Example Cassandra Aswani Vonteddu
  • 20. Apache Cassandra • Indexed by a Key • Supports columns and super-columns • Allows structured/un-structured data Aswani Vonteddu
  • 21. Cassandra N 1 N N 4 2 N 3 Aswani Vonteddu
  • 22. Cassandra Coordinator N 1 3. Success 1. ConsistencyLevel.ONE 2. Write request 2. Write N request N 4 2 Replica node Responsible node N 3 Aswani Vonteddu
  • 23. Cassandra Coordinator N 1 3. Success 1. ConsistencyLevel.ONE 2. Write request 2. Write 4. Success N N request 4 2 Replica node Responsible node N 3 Aswani Vonteddu
  • 24. Cassandra Coordinator N 1 3 or 4. Success 3 or 4. Success 1. ConsistencyLevel.TWO 2. Write request 2. Write N request N 4 2 Replica node Responsible node N 3 Aswani Vonteddu
  • 25. Cassandra Coordinator N 1 3 or 4. Success 3 or 4. Success 1. ConsistencyLevel.TWO 2. Write request 2. Write 5. Success N N request 4 2 Replica node Responsible node N 3 Aswani Vonteddu
  • 26. Cassandra • Write operation: – Commit log – Memtable – In-Memory storage structure (kind of a hash table) – SSTable on disk – Compaction Aswani Vonteddu
  • 27. Cassandra • Read operation: – Coordinator node forwards the request • to the node responsible • And replica nodes based on the consistency level requested – Each node • Looks up in the Memtable + all existing SSTables • Takes the one with the latest timestamp. – Bloom filters help speed up this operation Aswani Vonteddu
  • 28. Cassandra Indexes: • Primary index (on the key) supported default by the Cassandra engine • Secondary indexes are to be built as a new column family with the column of interest as the key Aswani Vonteddu
  • 29. Document DBs • Similar to Key-Value stores, but Values are often documents (JSON, ION, …) • Documents are versioned • Example DynamoDB Aswani Vonteddu
  • 30. Map Reduce • Introduced by Google • List processing system • Scales to clusters with thousands of nodes • And petabytes or Exabytes of data volumes • Code is taken to data, not otherwise • Data must be disjoint • Maps the functions to nodes where the data resides • And Reduces the results from all nodes to build the final result • Example: Hadoop Aswani Vonteddu
  • 31. Techniques & algorithms.. • Vector Clocks • Hinted handoff • Read repair • Anti-entropy repair Aswani Vonteddu
  • 32. Big Data talent • Deep analytical – Mathematicians, Operations research analysts, statisticians, .. • Big data savvy – Business and functional managers, budget, credit and financial analysts • Supporting Technology – DBAs, System & Network administrators, and Programmers Aswani Vonteddu
  • 33. The DBA’s role here? • Tremendous opportunity for the DBAs • Like in the early 90’s when businesses migrated from mainframes to Oracle/SQL Server/DB2 • Where? – Data modeling: Vast amounts of data, re-designing DHTs is harder than re-designing RDBMS by multiple folds since data migration is painful Aswani Vonteddu
  • 34. References [1] McKinsey, Big data: The next frontier for innovation, competition and productivity [2] IDC, The rise of Big Data: Managing, Storing and gaining value from endless information • Others – http://slidesha.re/LF8umk – http://slidesha.re/LF8vGY Aswani Vonteddu

Editor's Notes

  • #12: Industries: Healthcare, Telecommunications, Retail, Manufacturing, Public sector