�ݺ�ߣ

How to get started in
Big Data for Master’s
Students
Mohamed Nadjib Mami
mami@cs.uni-bonn.de
24 March 2018

1. Big Data is a “way of thinking” not a “Domain”
- It is a Situation
- It is a Way of thinking
- It is an Adaptation
- It is not a Domain
- It is not a Specialty
- It is not not only Big in size
Limitation of traditional systems
- Size of computational data
- Speed of flowing data
- Formats of data
… Quality/trustworthiness of data
… Importance of data
Dimensions
- Volume
- Velocity
- Variety
- Veracity
- Value
2

2. Big Data is Data Management in the back
Source: DAMA-DMBOK2 Framework 2014
● It is all about interacting with data
○ Collect
○ Store
○ Maintain & control
○ Retrieve
○ Analyse
3

● Take Data Management class, most importantly:
○ Relational algebra and database, ACID properties
○ SQL query language (focus on join and aggregation queries)
○ NOSQL, CAP theorem, BASE properties
○ Batch vs. stream vs. interactive processing
○ Lambda vs. Kappa architectures
○ Data Lake vs. Data Warehouse concepts
4

● Relational model
○ The basics of basics ... the past, present (& future?)
○ Data modeled in form of relations
■ Algebra: project, select, join, aggregate, union, intersect...
○ Data stored RDBMS in tables, tuples, attributes...
● ACID Properties => guarantees DB integrity
○ Atomicity … apply all ops or nothing
○ Consistency … changes respect constraint
○ Isolation … parallel changes do not interfere
○ Durability … no committed change is lost
5

● SQL: Structured Query Language
○ Declarative Query Language for Structured data (tables)
○ Aka. relational query language
■ Implements the relational algebra functions
○ (You should) Focus on JOIN and AGGREGATION
■ JOIN is the bases of querying
■ AGGREGATE is the bases of data analytics
6

● NOSQL (aka. non-relational) = Not Only SQL
○ New application needs => new DB management systems
■ Scalable and scale-out solutions (distributed)
■ Representations other than relational/SQL
■ Flexible schema
○ Not only SQL?
■ Similar syntaxes to SQL are used
● CQL (Cassandra Query Language)
7

○ Quick lookups (hash, dictionary)
○ Query semi-structured data
○ Query flexible-schema tables
○ Query highly interconnected data
○ A mix of the above (multi-model)
● SQL & NOSQL = friends not foes (complementary)
8
Key-value
Document
Columnar
Graph

○ Key-value (Simplest NOSQL model)
■ Encode all data in form of (key : value) pairs
■ Long distributed dictionaries/hash
■ Access: HTTP requests, API, etc.
■ Examples:
● Riak, Redis, Voldemort, Dynamo
9
105 abd
106 azb
107 tvu
108 lol

○ Document-oriented
■ Encode data in form of semi-structured “documents”
● Commonly in JSON-like
■ Access: HTTP requests, API, etc.
■ Examples:
● MongoDB, CouchDB, Couchbase
10
{
"FirstName": "AAA",
"LastName": "BBB",
"Hobbies":
["painting",”swimming”]
}

○ Columnar
■ Store data in columns (vs. rows in RDBMS)
● Optimized for analytical queries OLAP
■ Based on Columns families
● Like RDBMS tables but with unfixed schema
■ Examples:
● Cassandra, HBase, Accumulo, Bigtable
11

○ Graph-oriented
■ Model data in form of graphs (edges and vertices)
■ Optimal for storing highly interconnected
Graph-shaped data
● Query data by traversal
■ Examples:
● Neo4j, infinitegraph, Neptune
12

● NOSQL and distributed systems (network, shared-data)
○ CAP theorem for designing distributed systems
■ Consistency returns latest results
■ Availability has to return result even stale
■ Partition tolerance tolerate data loss between nodes
○ In present of P choose between C and A (tradeoff)
■ C: query errors or times out as requested data is n/a
■ A: query returns out-of-data results
13

○ CAP theorem for designing distributed systems
■ too simplistic | good to learn the basics
○ PACELC extends CAP
■ P(A|C)E(L|C) = if P choose A or C Else choose E or C
14
Partition?
Latency
Consistency
Availability
Consistency
Elsethen

○ BASE of NOSQL (contrasting ACID of RDBMS)
○ Suggested by the same person as ACID
○ Basically available guarantees CAP Availability
○ Soft state system state may change over time
○ Eventual consistency system will become consistent over
time
15

● Batch vs. stream vs. interactive processing
○ Batch: actions applied to bulked data periodically
■ Example: Extract-Transform-Load (ETL)
○ Real-time: computation applied to streams once arrived
■ Example: analyse sensors weather data
○ Interactive/iterative:
■ Example: Machine Learning algorithms
16

● Lambda vs. Kappa architectures
○ Lambda architecture
■ Three layers:
● Batch
● Speed
● Serving
■ Fault-tolerant
■ Scalable
17
Source: MapR - Lambda Architecture

● Lambda vs. Kappa architectures
○ Kappa architecture
■ Batch layers omitted => batch special case of stream
18
Source: O’reilly: Applying the Kappa architecture in the telco industry

● Data Warehouse can be implemented on top of Data Lake
19
Data Lake Data Warehouse
Repository of raw-data in its original form A well structured data repository
Append-only, read-only Read and write
Schema-on-read (no predefined schema) Schema-on-right (well predefined schema)
ETL (Extract, Transform, Load) ELT (Extract, Load, Transform)
Open to any access tools incl. DWH tools BI and OLAP tools and standards

3. Think big, think distributed
● Adaptation: now we deal with cluster-wide large scale data
● New essential factors come into play
○ Movement (aka shuffling)...
○ Reading and writing…
● MUST-know: fault-tolerance, replication, high-availability,
distributed file system ...in addition to previous concepts
○ Advise: learn them from Hadoop (HDFS), Apache Spark
20
...of large data

4. Adopt an “Optimizer” way of thinking
● History: my code works!
● Now: my code works fast
⇒ a slowly working code ~= not working code
○ How fast my app gets the job done? (performance)
○ How much output my app generates (throughput)
● Tuning and optimization are your new concerns e.g.
○ Reduce shuffled data (moved)
○ Reduce data written to/read from disk
21

General advice and comments
● Don’t move to big data settings if you don’t have to
● Don’t hesitate to start it if you feel like … it’s a lot of fun! :)
● For people who intend to do research in relation to big data
○ I have an idea, I just need to implement it becomes
○ I just have an idea, I need to implement it
○ Two phases instead of one:
■ 1. Make it work in your single-machine
■ 2. Make it work in your cluster >> and optimize
○ But it’s a lot of fun … still!
● Can all that fade off? Yes, as anything can, but unlikely any sooner
22

Wrap-up
1. Big Data is a Way of thinking not a Domain
3. Think big, think distributed
4. Adopt an “Optimizer” way of thinking
23
questions

�ݺ�ߣ

How to get started in Big Data for master's students

More Related Content

How to get started in Big Data for master's students