For interested students, or to some extent, for general curious persons, here's an experience-based guidelines about the topic of Big Data.
1 of 23
Download to read offline
More Related Content
How to get started in Big Data for master's students
1. How to get started in
Big Data for Masters
Mohamed Nadjib Mami
24 March 2018
2. 1. Big Data is a way of thinking not a Domain
- It is a Situation
- It is a Way of thinking
- It is an Adaptation
- It is not a Domain
- It is not a Specialty
- It is not not only Big in size
Limitation of traditional systems
- Size of computational data
- Speed of flowing data
- Formats of data
Quality/trustworthiness of data
Importance of data
- Volume
- Velocity
- Variety
- Veracity
- Value
3. 2. Big Data is Data Management in the back
Source: DAMA-DMBOK2 Framework 2014
It is all about interacting with data
Maintain & control
4. 2. Big Data is Data Management in the back
Take Data Management class, most importantly:
Relational algebra and database, ACID properties
SQL query language (focus on join and aggregation queries)
NOSQL, CAP theorem, BASE properties
Batch vs. stream vs. interactive processing
Lambda vs. Kappa architectures
Data Lake vs. Data Warehouse concepts
5. 2. Big Data is Data Management in the back
Relational model
The basics of basics ... the past, present (& future?)
Data modeled in form of relations
Algebra: project, select, join, aggregate, union, intersect...
Data stored RDBMS in tables, tuples, attributes...
ACID Properties => guarantees DB integrity
Atomicity apply all ops or nothing
Consistency changes respect constraint
Isolation parallel changes do not interfere
Durability no committed change is lost
6. 2. Big Data is Data Management in the back
SQL: Structured Query Language
Declarative Query Language for Structured data (tables)
Aka. relational query language
Implements the relational algebra functions
(You should) Focus on JOIN and AGGREGATION
JOIN is the bases of querying
AGGREGATE is the bases of data analytics
7. 2. Big Data is Data Management in the back
NOSQL (aka. non-relational) = Not Only SQL
New application needs => new DB management systems
Scalable and scale-out solutions (distributed)
Representations other than relational/SQL
Flexible schema
Not only SQL?
Similar syntaxes to SQL are used
CQL (Cassandra Query Language)
8. 2. Big Data is Data Management in the back
NOSQL (aka. non-relational) = Not Only SQL
Quick lookups (hash, dictionary)
Query semi-structured data
Query flexible-schema tables
Query highly interconnected data
A mix of the above (multi-model)
SQL & NOSQL = friends not foes (complementary)
9. 2. Big Data is Data Management in the back
NOSQL (aka. non-relational) = Not Only SQL
Key-value (Simplest NOSQL model)
Encode all data in form of (key : value) pairs
Long distributed dictionaries/hash
Access: HTTP requests, API, etc.
Riak, Redis, Voldemort, Dynamo
105 abd
106 azb
107 tvu
108 lol
10. 2. Big Data is Data Management in the back
NOSQL (aka. non-relational) = Not Only SQL
Encode data in form of semi-structured documents
Commonly in JSON-like
Access: HTTP requests, API, etc.
MongoDB, CouchDB, Couchbase
"FirstName": "AAA",
"LastName": "BBB",
11. 2. Big Data is Data Management in the back
NOSQL (aka. non-relational) = Not Only SQL
Store data in columns (vs. rows in RDBMS)
Optimized for analytical queries OLAP
Based on Columns families
Like RDBMS tables but with unfixed schema
Cassandra, HBase, Accumulo, Bigtable
12. 2. Big Data is Data Management in the back
NOSQL (aka. non-relational) = Not Only SQL
Model data in form of graphs (edges and vertices)
Optimal for storing highly interconnected
Graph-shaped data
Query data by traversal
Neo4j, infinitegraph, Neptune
13. 2. Big Data is Data Management in the back
NOSQL and distributed systems (network, shared-data)
CAP theorem for designing distributed systems
Consistency returns latest results
Availability has to return result even stale
Partition tolerance tolerate data loss between nodes
In present of P choose between C and A (tradeoff)
C: query errors or times out as requested data is n/a
A: query returns out-of-data results
14. 2. Big Data is Data Management in the back
NOSQL and distributed systems (network, shared-data)
CAP theorem for designing distributed systems
too simplistic | good to learn the basics
PACELC extends CAP
P(A|C)E(L|C) = if P choose A or C Else choose E or C
15. 2. Big Data is Data Management in the back
NOSQL and distributed systems (network, shared-data)
BASE of NOSQL (contrasting ACID of RDBMS)
Suggested by the same person as ACID
Basically available guarantees CAP Availability
Soft state system state may change over time
Eventual consistency system will become consistent over
16. 2. Big Data is Data Management in the back
Batch vs. stream vs. interactive processing
Batch: actions applied to bulked data periodically
Example: Extract-Transform-Load (ETL)
Real-time: computation applied to streams once arrived
Example: analyse sensors weather data
Example: Machine Learning algorithms
17. 2. Big Data is Data Management in the back
Lambda vs. Kappa architectures
Lambda architecture
Three layers:
Source: MapR - Lambda Architecture
18. 2. Big Data is Data Management in the back
Lambda vs. Kappa architectures
Kappa architecture
Batch layers omitted => batch special case of stream
Source: Oreilly: Applying the Kappa architecture in the telco industry
19. 2. Big Data is Data Management in the back
Data Warehouse can be implemented on top of Data Lake
Data Lake Data Warehouse
Repository of raw-data in its original form A well structured data repository
Append-only, read-only Read and write
Schema-on-read (no predefined schema) Schema-on-right (well predefined schema)
ETL (Extract, Transform, Load) ELT (Extract, Load, Transform)
Open to any access tools incl. DWH tools BI and OLAP tools and standards
20. 3. Think big, think distributed
Adaptation: now we deal with cluster-wide large scale data
New essential factors come into play
Movement (aka shuffling)...
Reading and writing
MUST-know: fault-tolerance, replication, high-availability,
distributed file system addition to previous concepts
Advise: learn them from Hadoop (HDFS), Apache Spark
...of large data
21. 4. Adopt an Optimizer way of thinking
History: my code works!
Now: my code works fast
a slowly working code ~= not working code
How fast my app gets the job done? (performance)
How much output my app generates (throughput)
Tuning and optimization are your new concerns e.g.
Reduce shuffled data (moved)
Reduce data written to/read from disk
22. General advice and comments
Dont move to big data settings if you dont have to
Dont hesitate to start it if you feel like its a lot of fun! :)
For people who intend to do research in relation to big data
I have an idea, I just need to implement it becomes
I just have an idea, I need to implement it
Two phases instead of one:
1. Make it work in your single-machine
2. Make it work in your cluster >> and optimize
But its a lot of fun still!
Can all that fade off? Yes, as anything can, but unlikely any sooner
23. Wrap-up
1. Big Data is a Way of thinking not a Domain
2. Big Data is Data Management in the back
3. Think big, think distributed
4. Adopt an Optimizer way of thinking