For interested students, or to some extent, for general curious persons, here's an experience-based guidelines about the topic of Big Data.
1 of 23
Download to read offline
More Related Content
How to get started in Big Data for master's students
1. How to get started in
Big Data for Masters
Students
Mohamed Nadjib Mami
mami@cs.uni-bonn.de
24 March 2018
2. 1. Big Data is a way of thinking not a Domain
- It is a Situation
- It is a Way of thinking
- It is an Adaptation
- It is not a Domain
- It is not a Specialty
- It is not not only Big in size
Limitation of traditional systems
- Size of computational data
- Speed of flowing data
- Formats of data
Quality/trustworthiness of data
Importance of data
Dimensions
- Volume
- Velocity
- Variety
- Veracity
- Value
2
3. 2. Big Data is Data Management in the back
Source: DAMA-DMBOK2 Framework 2014
It is all about interacting with data
Collect
Store
Maintain & control
Retrieve
Analyse
3
4. 2. Big Data is Data Management in the back
Take Data Management class, most importantly:
Relational algebra and database, ACID properties
SQL query language (focus on join and aggregation queries)
NOSQL, CAP theorem, BASE properties
Batch vs. stream vs. interactive processing
Lambda vs. Kappa architectures
Data Lake vs. Data Warehouse concepts
4
5. 2. Big Data is Data Management in the back
Relational model
The basics of basics ... the past, present (& future?)
Data modeled in form of relations
Algebra: project, select, join, aggregate, union, intersect...
Data stored RDBMS in tables, tuples, attributes...
ACID Properties => guarantees DB integrity
Atomicity apply all ops or nothing
Consistency changes respect constraint
Isolation parallel changes do not interfere
Durability no committed change is lost
5
6. 2. Big Data is Data Management in the back
SQL: Structured Query Language
Declarative Query Language for Structured data (tables)
Aka. relational query language
Implements the relational algebra functions
(You should) Focus on JOIN and AGGREGATION
JOIN is the bases of querying
AGGREGATE is the bases of data analytics
6
7. 2. Big Data is Data Management in the back
NOSQL (aka. non-relational) = Not Only SQL
New application needs => new DB management systems
Scalable and scale-out solutions (distributed)
Representations other than relational/SQL
Flexible schema
Not only SQL?
Similar syntaxes to SQL are used
CQL (Cassandra Query Language)
7
8. 2. Big Data is Data Management in the back
NOSQL (aka. non-relational) = Not Only SQL
Quick lookups (hash, dictionary)
Query semi-structured data
Query flexible-schema tables
Query highly interconnected data
A mix of the above (multi-model)
SQL & NOSQL = friends not foes (complementary)
8
Key-value
Document
Columnar
Graph
9. 2. Big Data is Data Management in the back
NOSQL (aka. non-relational) = Not Only SQL
Key-value (Simplest NOSQL model)
Encode all data in form of (key : value) pairs
Long distributed dictionaries/hash
Access: HTTP requests, API, etc.
Examples:
Riak, Redis, Voldemort, Dynamo
9
105 abd
106 azb
107 tvu
108 lol
10. 2. Big Data is Data Management in the back
NOSQL (aka. non-relational) = Not Only SQL
Document-oriented
Encode data in form of semi-structured documents
Commonly in JSON-like
Access: HTTP requests, API, etc.
Examples:
MongoDB, CouchDB, Couchbase
10
{
"FirstName": "AAA",
"LastName": "BBB",
"Hobbies":
["painting",swimming]
}
11. 2. Big Data is Data Management in the back
NOSQL (aka. non-relational) = Not Only SQL
Columnar
Store data in columns (vs. rows in RDBMS)
Optimized for analytical queries OLAP
Based on Columns families
Like RDBMS tables but with unfixed schema
Examples:
Cassandra, HBase, Accumulo, Bigtable
11
12. 2. Big Data is Data Management in the back
NOSQL (aka. non-relational) = Not Only SQL
Graph-oriented
Model data in form of graphs (edges and vertices)
Optimal for storing highly interconnected
Graph-shaped data
Query data by traversal
Examples:
Neo4j, infinitegraph, Neptune
12
13. 2. Big Data is Data Management in the back
NOSQL and distributed systems (network, shared-data)
CAP theorem for designing distributed systems
Consistency returns latest results
Availability has to return result even stale
Partition tolerance tolerate data loss between nodes
In present of P choose between C and A (tradeoff)
C: query errors or times out as requested data is n/a
A: query returns out-of-data results
13
14. 2. Big Data is Data Management in the back
NOSQL and distributed systems (network, shared-data)
CAP theorem for designing distributed systems
too simplistic | good to learn the basics
PACELC extends CAP
P(A|C)E(L|C) = if P choose A or C Else choose E or C
14
Partition?
Latency
Consistency
Availability
Consistency
Elsethen
15. 2. Big Data is Data Management in the back
NOSQL and distributed systems (network, shared-data)
BASE of NOSQL (contrasting ACID of RDBMS)
Suggested by the same person as ACID
Basically available guarantees CAP Availability
Soft state system state may change over time
Eventual consistency system will become consistent over
time
15
16. 2. Big Data is Data Management in the back
Batch vs. stream vs. interactive processing
Batch: actions applied to bulked data periodically
Example: Extract-Transform-Load (ETL)
Real-time: computation applied to streams once arrived
Example: analyse sensors weather data
Interactive/iterative:
Example: Machine Learning algorithms
16
17. 2. Big Data is Data Management in the back
Lambda vs. Kappa architectures
Lambda architecture
Three layers:
Batch
Speed
Serving
Fault-tolerant
Scalable
17
Source: MapR - Lambda Architecture
18. 2. Big Data is Data Management in the back
Lambda vs. Kappa architectures
Kappa architecture
Batch layers omitted => batch special case of stream
18
Source: Oreilly: Applying the Kappa architecture in the telco industry
19. 2. Big Data is Data Management in the back
Data Warehouse can be implemented on top of Data Lake
19
Data Lake Data Warehouse
Repository of raw-data in its original form A well structured data repository
Append-only, read-only Read and write
Schema-on-read (no predefined schema) Schema-on-right (well predefined schema)
ETL (Extract, Transform, Load) ELT (Extract, Load, Transform)
Open to any access tools incl. DWH tools BI and OLAP tools and standards
20. 3. Think big, think distributed
Adaptation: now we deal with cluster-wide large scale data
New essential factors come into play
Movement (aka shuffling)...
Reading and writing
MUST-know: fault-tolerance, replication, high-availability,
distributed file system ...in addition to previous concepts
Advise: learn them from Hadoop (HDFS), Apache Spark
20
...of large data
21. 4. Adopt an Optimizer way of thinking
History: my code works!
Now: my code works fast
a slowly working code ~= not working code
How fast my app gets the job done? (performance)
How much output my app generates (throughput)
Tuning and optimization are your new concerns e.g.
Reduce shuffled data (moved)
Reduce data written to/read from disk
21
22. General advice and comments
Dont move to big data settings if you dont have to
Dont hesitate to start it if you feel like its a lot of fun! :)
For people who intend to do research in relation to big data
I have an idea, I just need to implement it becomes
I just have an idea, I need to implement it
Two phases instead of one:
1. Make it work in your single-machine
2. Make it work in your cluster >> and optimize
But its a lot of fun still!
Can all that fade off? Yes, as anything can, but unlikely any sooner
22
23. Wrap-up
1. Big Data is a Way of thinking not a Domain
2. Big Data is Data Management in the back
3. Think big, think distributed
4. Adopt an Optimizer way of thinking
23
questions