際際滷

際際滷Share a Scribd company logo
How to get started in
Big Data for Masters
Students
Mohamed Nadjib Mami
mami@cs.uni-bonn.de
24 March 2018
1. Big Data is a way of thinking not a Domain
- It is a Situation
- It is a Way of thinking
- It is an Adaptation
- It is not a Domain
- It is not a Specialty
- It is not not only Big in size
Limitation of traditional systems
- Size of computational data
- Speed of flowing data
- Formats of data
 Quality/trustworthiness of data
 Importance of data
Dimensions
- Volume
- Velocity
- Variety
- Veracity
- Value
2
2. Big Data is Data Management in the back
Source: DAMA-DMBOK2 Framework 2014
 It is all about interacting with data
 Collect
 Store
 Maintain & control
 Retrieve
 Analyse
3
2. Big Data is Data Management in the back
 Take Data Management class, most importantly:
 Relational algebra and database, ACID properties
 SQL query language (focus on join and aggregation queries)
 NOSQL, CAP theorem, BASE properties
 Batch vs. stream vs. interactive processing
 Lambda vs. Kappa architectures
 Data Lake vs. Data Warehouse concepts
4
2. Big Data is Data Management in the back
 Relational model
 The basics of basics ... the past, present (& future?)
 Data modeled in form of relations
 Algebra: project, select, join, aggregate, union, intersect...
 Data stored RDBMS in tables, tuples, attributes...
 ACID Properties => guarantees DB integrity
 Atomicity  apply all ops or nothing
 Consistency  changes respect constraint
 Isolation  parallel changes do not interfere
 Durability  no committed change is lost
5
2. Big Data is Data Management in the back
 SQL: Structured Query Language
 Declarative Query Language for Structured data (tables)
 Aka. relational query language
 Implements the relational algebra functions
 (You should) Focus on JOIN and AGGREGATION
 JOIN is the bases of querying
 AGGREGATE is the bases of data analytics
6
2. Big Data is Data Management in the back
 NOSQL (aka. non-relational) = Not Only SQL
 New application needs => new DB management systems
 Scalable and scale-out solutions (distributed)
 Representations other than relational/SQL
 Flexible schema
 Not only SQL?
 Similar syntaxes to SQL are used
 CQL (Cassandra Query Language)
7
2. Big Data is Data Management in the back
 NOSQL (aka. non-relational) = Not Only SQL
 Quick lookups (hash, dictionary)
 Query semi-structured data
 Query flexible-schema tables
 Query highly interconnected data
 A mix of the above (multi-model)
 SQL & NOSQL = friends not foes (complementary)
8
Key-value
Document
Columnar
Graph
2. Big Data is Data Management in the back
 NOSQL (aka. non-relational) = Not Only SQL
 Key-value (Simplest NOSQL model)
 Encode all data in form of (key : value) pairs
 Long distributed dictionaries/hash
 Access: HTTP requests, API, etc.
 Examples:
 Riak, Redis, Voldemort, Dynamo
9
105 abd
106 azb
107 tvu
108 lol
2. Big Data is Data Management in the back
 NOSQL (aka. non-relational) = Not Only SQL
 Document-oriented
 Encode data in form of semi-structured documents
 Commonly in JSON-like
 Access: HTTP requests, API, etc.
 Examples:
 MongoDB, CouchDB, Couchbase
10
{
"FirstName": "AAA",
"LastName": "BBB",
"Hobbies":
["painting",swimming]
}
2. Big Data is Data Management in the back
 NOSQL (aka. non-relational) = Not Only SQL
 Columnar
 Store data in columns (vs. rows in RDBMS)
 Optimized for analytical queries OLAP
 Based on Columns families
 Like RDBMS tables but with unfixed schema
 Examples:
 Cassandra, HBase, Accumulo, Bigtable
11
2. Big Data is Data Management in the back
 NOSQL (aka. non-relational) = Not Only SQL
 Graph-oriented
 Model data in form of graphs (edges and vertices)
 Optimal for storing highly interconnected
Graph-shaped data
 Query data by traversal
 Examples:
 Neo4j, infinitegraph, Neptune
12
2. Big Data is Data Management in the back
 NOSQL and distributed systems (network, shared-data)
 CAP theorem for designing distributed systems
 Consistency returns latest results
 Availability has to return result even stale
 Partition tolerance tolerate data loss between nodes
 In present of P choose between C and A (tradeoff)
 C: query errors or times out as requested data is n/a
 A: query returns out-of-data results
13
2. Big Data is Data Management in the back
 NOSQL and distributed systems (network, shared-data)
 CAP theorem for designing distributed systems
 too simplistic | good to learn the basics
 PACELC extends CAP
 P(A|C)E(L|C) = if P choose A or C Else choose E or C
14
Partition?
Latency
Consistency
Availability
Consistency
Elsethen
2. Big Data is Data Management in the back
 NOSQL and distributed systems (network, shared-data)
 BASE of NOSQL (contrasting ACID of RDBMS)
 Suggested by the same person as ACID
 Basically available guarantees CAP Availability
 Soft state system state may change over time
 Eventual consistency system will become consistent over
time
15
2. Big Data is Data Management in the back
 Batch vs. stream vs. interactive processing
 Batch: actions applied to bulked data periodically
 Example: Extract-Transform-Load (ETL)
 Real-time: computation applied to streams once arrived
 Example: analyse sensors weather data
 Interactive/iterative:
 Example: Machine Learning algorithms
16
2. Big Data is Data Management in the back
 Lambda vs. Kappa architectures
 Lambda architecture
 Three layers:
 Batch
 Speed
 Serving
 Fault-tolerant
 Scalable
17
Source: MapR - Lambda Architecture
2. Big Data is Data Management in the back
 Lambda vs. Kappa architectures
 Kappa architecture
 Batch layers omitted => batch special case of stream
18
Source: Oreilly: Applying the Kappa architecture in the telco industry
2. Big Data is Data Management in the back
 Data Warehouse can be implemented on top of Data Lake
19
Data Lake Data Warehouse
Repository of raw-data in its original form A well structured data repository
Append-only, read-only Read and write
Schema-on-read (no predefined schema) Schema-on-right (well predefined schema)
ETL (Extract, Transform, Load) ELT (Extract, Load, Transform)
Open to any access tools incl. DWH tools BI and OLAP tools and standards
3. Think big, think distributed
 Adaptation: now we deal with cluster-wide large scale data
 New essential factors come into play
 Movement (aka shuffling)...
 Reading and writing
 MUST-know: fault-tolerance, replication, high-availability,
distributed file system ...in addition to previous concepts
 Advise: learn them from Hadoop (HDFS), Apache Spark
20
...of large data
4. Adopt an Optimizer way of thinking
 History: my code works!
 Now: my code works fast
 a slowly working code ~= not working code
 How fast my app gets the job done? (performance)
 How much output my app generates (throughput)
 Tuning and optimization are your new concerns e.g.
 Reduce shuffled data (moved)
 Reduce data written to/read from disk
21
General advice and comments
 Dont move to big data settings if you dont have to
 Dont hesitate to start it if you feel like  its a lot of fun! :)
 For people who intend to do research in relation to big data
 I have an idea, I just need to implement it becomes
 I just have an idea, I need to implement it
 Two phases instead of one:
 1. Make it work in your single-machine
 2. Make it work in your cluster >> and optimize
 But its a lot of fun  still!
 Can all that fade off? Yes, as anything can, but unlikely any sooner
22
Wrap-up
1. Big Data is a Way of thinking not a Domain
2. Big Data is Data Management in the back
3. Think big, think distributed
4. Adopt an Optimizer way of thinking
23
questions

More Related Content

How to get started in Big Data for master's students

  • 1. How to get started in Big Data for Masters Students Mohamed Nadjib Mami mami@cs.uni-bonn.de 24 March 2018
  • 2. 1. Big Data is a way of thinking not a Domain - It is a Situation - It is a Way of thinking - It is an Adaptation - It is not a Domain - It is not a Specialty - It is not not only Big in size Limitation of traditional systems - Size of computational data - Speed of flowing data - Formats of data Quality/trustworthiness of data Importance of data Dimensions - Volume - Velocity - Variety - Veracity - Value 2
  • 3. 2. Big Data is Data Management in the back Source: DAMA-DMBOK2 Framework 2014 It is all about interacting with data Collect Store Maintain & control Retrieve Analyse 3
  • 4. 2. Big Data is Data Management in the back Take Data Management class, most importantly: Relational algebra and database, ACID properties SQL query language (focus on join and aggregation queries) NOSQL, CAP theorem, BASE properties Batch vs. stream vs. interactive processing Lambda vs. Kappa architectures Data Lake vs. Data Warehouse concepts 4
  • 5. 2. Big Data is Data Management in the back Relational model The basics of basics ... the past, present (& future?) Data modeled in form of relations Algebra: project, select, join, aggregate, union, intersect... Data stored RDBMS in tables, tuples, attributes... ACID Properties => guarantees DB integrity Atomicity apply all ops or nothing Consistency changes respect constraint Isolation parallel changes do not interfere Durability no committed change is lost 5
  • 6. 2. Big Data is Data Management in the back SQL: Structured Query Language Declarative Query Language for Structured data (tables) Aka. relational query language Implements the relational algebra functions (You should) Focus on JOIN and AGGREGATION JOIN is the bases of querying AGGREGATE is the bases of data analytics 6
  • 7. 2. Big Data is Data Management in the back NOSQL (aka. non-relational) = Not Only SQL New application needs => new DB management systems Scalable and scale-out solutions (distributed) Representations other than relational/SQL Flexible schema Not only SQL? Similar syntaxes to SQL are used CQL (Cassandra Query Language) 7
  • 8. 2. Big Data is Data Management in the back NOSQL (aka. non-relational) = Not Only SQL Quick lookups (hash, dictionary) Query semi-structured data Query flexible-schema tables Query highly interconnected data A mix of the above (multi-model) SQL & NOSQL = friends not foes (complementary) 8 Key-value Document Columnar Graph
  • 9. 2. Big Data is Data Management in the back NOSQL (aka. non-relational) = Not Only SQL Key-value (Simplest NOSQL model) Encode all data in form of (key : value) pairs Long distributed dictionaries/hash Access: HTTP requests, API, etc. Examples: Riak, Redis, Voldemort, Dynamo 9 105 abd 106 azb 107 tvu 108 lol
  • 10. 2. Big Data is Data Management in the back NOSQL (aka. non-relational) = Not Only SQL Document-oriented Encode data in form of semi-structured documents Commonly in JSON-like Access: HTTP requests, API, etc. Examples: MongoDB, CouchDB, Couchbase 10 { "FirstName": "AAA", "LastName": "BBB", "Hobbies": ["painting",swimming] }
  • 11. 2. Big Data is Data Management in the back NOSQL (aka. non-relational) = Not Only SQL Columnar Store data in columns (vs. rows in RDBMS) Optimized for analytical queries OLAP Based on Columns families Like RDBMS tables but with unfixed schema Examples: Cassandra, HBase, Accumulo, Bigtable 11
  • 12. 2. Big Data is Data Management in the back NOSQL (aka. non-relational) = Not Only SQL Graph-oriented Model data in form of graphs (edges and vertices) Optimal for storing highly interconnected Graph-shaped data Query data by traversal Examples: Neo4j, infinitegraph, Neptune 12
  • 13. 2. Big Data is Data Management in the back NOSQL and distributed systems (network, shared-data) CAP theorem for designing distributed systems Consistency returns latest results Availability has to return result even stale Partition tolerance tolerate data loss between nodes In present of P choose between C and A (tradeoff) C: query errors or times out as requested data is n/a A: query returns out-of-data results 13
  • 14. 2. Big Data is Data Management in the back NOSQL and distributed systems (network, shared-data) CAP theorem for designing distributed systems too simplistic | good to learn the basics PACELC extends CAP P(A|C)E(L|C) = if P choose A or C Else choose E or C 14 Partition? Latency Consistency Availability Consistency Elsethen
  • 15. 2. Big Data is Data Management in the back NOSQL and distributed systems (network, shared-data) BASE of NOSQL (contrasting ACID of RDBMS) Suggested by the same person as ACID Basically available guarantees CAP Availability Soft state system state may change over time Eventual consistency system will become consistent over time 15
  • 16. 2. Big Data is Data Management in the back Batch vs. stream vs. interactive processing Batch: actions applied to bulked data periodically Example: Extract-Transform-Load (ETL) Real-time: computation applied to streams once arrived Example: analyse sensors weather data Interactive/iterative: Example: Machine Learning algorithms 16
  • 17. 2. Big Data is Data Management in the back Lambda vs. Kappa architectures Lambda architecture Three layers: Batch Speed Serving Fault-tolerant Scalable 17 Source: MapR - Lambda Architecture
  • 18. 2. Big Data is Data Management in the back Lambda vs. Kappa architectures Kappa architecture Batch layers omitted => batch special case of stream 18 Source: Oreilly: Applying the Kappa architecture in the telco industry
  • 19. 2. Big Data is Data Management in the back Data Warehouse can be implemented on top of Data Lake 19 Data Lake Data Warehouse Repository of raw-data in its original form A well structured data repository Append-only, read-only Read and write Schema-on-read (no predefined schema) Schema-on-right (well predefined schema) ETL (Extract, Transform, Load) ELT (Extract, Load, Transform) Open to any access tools incl. DWH tools BI and OLAP tools and standards
  • 20. 3. Think big, think distributed Adaptation: now we deal with cluster-wide large scale data New essential factors come into play Movement (aka shuffling)... Reading and writing MUST-know: fault-tolerance, replication, high-availability, distributed file system ...in addition to previous concepts Advise: learn them from Hadoop (HDFS), Apache Spark 20 ...of large data
  • 21. 4. Adopt an Optimizer way of thinking History: my code works! Now: my code works fast a slowly working code ~= not working code How fast my app gets the job done? (performance) How much output my app generates (throughput) Tuning and optimization are your new concerns e.g. Reduce shuffled data (moved) Reduce data written to/read from disk 21
  • 22. General advice and comments Dont move to big data settings if you dont have to Dont hesitate to start it if you feel like its a lot of fun! :) For people who intend to do research in relation to big data I have an idea, I just need to implement it becomes I just have an idea, I need to implement it Two phases instead of one: 1. Make it work in your single-machine 2. Make it work in your cluster >> and optimize But its a lot of fun still! Can all that fade off? Yes, as anything can, but unlikely any sooner 22
  • 23. Wrap-up 1. Big Data is a Way of thinking not a Domain 2. Big Data is Data Management in the back 3. Think big, think distributed 4. Adopt an Optimizer way of thinking 23 questions