狠狠撸

狠狠撸Share a Scribd company logo
1
Introduction to Big Data
Dr. Amira Abdelatey
2
How to manage very large amounts of data and extract
value and knowledge from them
3
Introduction
? Big data refers to extremely large and diverse collections
of structured, unstructured, and semi-structured data that
continues to grow exponentially over time. These datasets
are so huge and complex in volume, velocity, and
variety, that traditional data management systems cannot
store, process, and analyze them.
? The amount and availability of data is growing rapidly,
spurred on by digital technology advancements, such as
connectivity, mobility, the Internet of Things (IoT), and
artificial intelligence (AI).
? Big data tools are emerging to help companies collect,
process, and analyze data at the speed needed to gain the
most value from it.
4
Introduction
What is Big Data?
What makes data, “Big” Data?
5
Big Data Definition
? No single standard definition…
“Big Data” is data whose scale, diversity,
and complexity require new architecture,
techniques, algorithms, and analytics to
manage it and extract value and hidden
knowledge from it…
6
Characteristics of Big Data:
1-Volume (Scale)
Exponential increase in
collected/generated data
? Data Volume
? 44x increase from 2009 2020
? From 0.8 zettabytes to 35zb
? Data volume is increasing exponentially
7
Characteristics of Big Data:
2-Variety (Complexity)
? Different types of data
? Relational Data (Tables/Transaction/Legacy Data)
? Text Data (Web)
? Semi-structured Data (XML)
? Graph Data
? Social Network, Semantic Web (RDF), …
? Streaming Data (Stream vs static)
? You can only scan the data once
? A single application can be generating/collecting many
types of data
? Big Public Data (online, weather, finance, etc)
To extract knowledge? all these types of data need to linked together
8
Characteristics of Big Data:
3-Velocity (Speed)
? Data is begin generated fast and need to be processed fast
? Online Data Analytics
? Late decisions ? missing opportunities
? Examples
? E-Promotions: Based on your current location, your purchase history, what you
like ? send promotions right now for store next to you
? Healthcare monitoring: sensors monitoring your activities and body ? any
abnormal measurements require immediate reaction
9
Real-time/Fast Data
Social media and networks
(all of us are generating data)
Scientific instruments
(collecting all sorts of data)
Mobile devices
(tracking all objects all the time)
Sensor technology and
networks
(measuring all kinds of data)
? The progress and innovation is no longer hindered by the ability to collect data
? But, by the ability to manage, analyze, summarize, visualize, and discover knowledge
from the collected data in a timely manner and in a scalable fashion
10
Real-Time Analytics/Decision Requirement
Customer
Influence
Behavior
Product
Recommendations
that are Relevant
& Compelling
Friend Invitations
to join a
Game or Activity
that expands
business
Preventing Fraud
as it is Occurring
& preventing more
proactively
Learning why Customers
Switch to competitors
and their offers; in
time to Counter
Improving the
Marketing
Effectiveness of a
Promotion while it
is still in Play
11
Some Make it 4痴’蝉
12
Harnessing ??????? Big Data
? OLTP: Online Transaction Processing (DBMSs)
? OLAP: Online Analytical Processing (Data Warehousing)
? RTAP: Real-Time Analytics Processing (Big Data Architecture & technology)
13
The Model Has Changed…
? The Model of Generating/Consuming Data has Changed
Old Model: Few companies are generating data, all others are consuming data
New Model: all of us are generating data, and all of us are consuming
data
14
What’s driving Big Data
- Ad-hoc querying and reporting
- Data mining techniques
- Structured data, typical sources
- Small to mid-size datasets
- Optimizations and predictive analytics
- Complex statistical analysis
- All types of data, and many sources
- Very large datasets
- More of a real-time
15
Cloud Computing
? The cloud is a distributed collection of servers that host software
and infrastructure, and it is accessed over the Internet
? IT resources provided as a service
? Compute, storage, databases, queues
? Clouds leverage economies of scale of commodity ?????? hardware
? Cheap storage, high bandwidth networks & multicore processors
? Geographically distributed data centers
? Offerings from Microsoft, Amazon, Google, …
16
wikipedia:Cloud Computing
17
Benefits
? Cost & management
? Economies of scale, “out-sourced” resource management
? Reduced Time to deployment
? Ease of assembly, works “out of the box”
? Scaling
? On demand provisioning, co-locate data and compute
? Reliability
? Massive, redundant, shared resources
? Sustainability
? Hardware not owned
18
Types of Cloud Computing
? Public Cloud: Computing infrastructure is hosted at the
vendor’s premises.
? Private Cloud: Computing architecture is dedicated to the
customer and is not shared with other organizations.
? Hybrid Cloud: Organizations host some critical, secure
applications in private clouds. The not so critical applications
are hosted in the public cloud
? Cloud bursting: the organization uses its own infrastructure for normal
usage, but cloud is used for peak loads.
? Community Cloud
19
Classification of Cloud Computing
based on Service Provided
? Infrastructure as a service (IaaS)
? Offering hardware related services using the principles of cloud computing. These could
include storage services (database or disk storage) or virtual servers.
? Amazon EC2, Amazon S3, Rackspace Cloud Servers and Flexiscale.
? Platform as a Service (PaaS)
? Offering a development platform on the cloud.
? Google’s Application Engine, Microsofts Azure.
? Software as a service (SaaS)
? Including a complete software offering on the cloud. Users
can access a software application hosted by the cloud vendor
on pay-per-use basis. This is a well-established sector.
? Salesforce.coms’ offering in the online Customer Relationship
Management (CRM) space, Googles gmail and Microsofts
hotmail, Google docs.
20
Topics overview
Section
? Postgres with Python
? ETL with python
? Framework for big data (Python spark)
? Data modelling
? Data warehouse
? Dimensional data modeling
? ETL
? The power of spark
? Big data processing pipeline
? Data wrangling with spark
? Natural Language Processing
? Association rule mining

More Related Content

Similar to Lecture 1-big data engineering (Introduction).pdf (20)

IT webinar 2016
IT webinar 2016IT webinar 2016
IT webinar 2016
PR Cell, IIM Rohtak
?
Big data ppt
Big  data pptBig  data ppt
Big data ppt
Nasrin Hussain
?
Content1. Introduction2. What is Big Data3. Characte.docx
Content1. Introduction2. What is Big Data3. Characte.docxContent1. Introduction2. What is Big Data3. Characte.docx
Content1. Introduction2. What is Big Data3. Characte.docx
dickonsondorris
?
Big-Data-Seminar-6-Aug-2014-Koenig
Big-Data-Seminar-6-Aug-2014-KoenigBig-Data-Seminar-6-Aug-2014-Koenig
Big-Data-Seminar-6-Aug-2014-Koenig
Manish Chopra
?
TOPIC.pptx
TOPIC.pptxTOPIC.pptx
TOPIC.pptx
infinix8
?
Big Data Analytics PPT - S1 working .pptx
Big Data Analytics PPT - S1 working .pptxBig Data Analytics PPT - S1 working .pptx
Big Data Analytics PPT - S1 working .pptx
VivekChaurasia43
?
PresentationBig Data111111111111111.pptx
PresentationBig Data111111111111111.pptxPresentationBig Data111111111111111.pptx
PresentationBig Data111111111111111.pptx
harshadbhaitalpada49
?
Bigdatappt 140225061440-phpapp01
Bigdatappt 140225061440-phpapp01Bigdatappt 140225061440-phpapp01
Bigdatappt 140225061440-phpapp01
nayanbhatia2
?
Analytics&IoT
Analytics&IoTAnalytics&IoT
Analytics&IoT
Selvaraj Kesavan
?
Big Data Analytics M1.pdf big data analytics
Big Data Analytics M1.pdf big data analyticsBig Data Analytics M1.pdf big data analytics
Big Data Analytics M1.pdf big data analytics
nithishlkumar9194
?
bigdatappt.pptx
bigdatappt.pptxbigdatappt.pptx
bigdatappt.pptx
KrishnaTeja570279
?
Big Data Session 1.pptx
Big Data Session 1.pptxBig Data Session 1.pptx
Big Data Session 1.pptx
ElsonPaul2
?
Data Fabric - Why Should Organizations Implement a Logical and Not a Physical...
Data Fabric - Why Should Organizations Implement a Logical and Not a Physical...Data Fabric - Why Should Organizations Implement a Logical and Not a Physical...
Data Fabric - Why Should Organizations Implement a Logical and Not a Physical...
Denodo
?
Big data
Big dataBig data
Big data
madhavsolanki
?
Big_Data_ppt[1] (1).pptx
Big_Data_ppt[1] (1).pptxBig_Data_ppt[1] (1).pptx
Big_Data_ppt[1] (1).pptx
TanguturiAvinash
?
data science unit 2 bigdata introduction .pptx
data science unit 2 bigdata introduction .pptxdata science unit 2 bigdata introduction .pptx
data science unit 2 bigdata introduction .pptx
NithiMini
?
Big Data.pptx
Big Data.pptxBig Data.pptx
Big Data.pptx
naveenlingala2
?
Bigdata (1) converted
Bigdata (1) convertedBigdata (1) converted
Bigdata (1) converted
THILAKAVATHIRAMRAJ
?
Special issues on big data
Special issues on big dataSpecial issues on big data
Special issues on big data
Vedanand Singh
?
Big Data
Big DataBig Data
Big Data
Seminar Links
?
Content1. Introduction2. What is Big Data3. Characte.docx
Content1. Introduction2. What is Big Data3. Characte.docxContent1. Introduction2. What is Big Data3. Characte.docx
Content1. Introduction2. What is Big Data3. Characte.docx
dickonsondorris
?
Big-Data-Seminar-6-Aug-2014-Koenig
Big-Data-Seminar-6-Aug-2014-KoenigBig-Data-Seminar-6-Aug-2014-Koenig
Big-Data-Seminar-6-Aug-2014-Koenig
Manish Chopra
?
Big Data Analytics PPT - S1 working .pptx
Big Data Analytics PPT - S1 working .pptxBig Data Analytics PPT - S1 working .pptx
Big Data Analytics PPT - S1 working .pptx
VivekChaurasia43
?
PresentationBig Data111111111111111.pptx
PresentationBig Data111111111111111.pptxPresentationBig Data111111111111111.pptx
PresentationBig Data111111111111111.pptx
harshadbhaitalpada49
?
Bigdatappt 140225061440-phpapp01
Bigdatappt 140225061440-phpapp01Bigdatappt 140225061440-phpapp01
Bigdatappt 140225061440-phpapp01
nayanbhatia2
?
Big Data Analytics M1.pdf big data analytics
Big Data Analytics M1.pdf big data analyticsBig Data Analytics M1.pdf big data analytics
Big Data Analytics M1.pdf big data analytics
nithishlkumar9194
?
Big Data Session 1.pptx
Big Data Session 1.pptxBig Data Session 1.pptx
Big Data Session 1.pptx
ElsonPaul2
?
Data Fabric - Why Should Organizations Implement a Logical and Not a Physical...
Data Fabric - Why Should Organizations Implement a Logical and Not a Physical...Data Fabric - Why Should Organizations Implement a Logical and Not a Physical...
Data Fabric - Why Should Organizations Implement a Logical and Not a Physical...
Denodo
?
data science unit 2 bigdata introduction .pptx
data science unit 2 bigdata introduction .pptxdata science unit 2 bigdata introduction .pptx
data science unit 2 bigdata introduction .pptx
NithiMini
?
Special issues on big data
Special issues on big dataSpecial issues on big data
Special issues on big data
Vedanand Singh
?

Recently uploaded (20)

data mining tools.pptxvdvjdggmgmgelmgleg
data mining tools.pptxvdvjdggmgmgelmglegdata mining tools.pptxvdvjdggmgmgelmgleg
data mining tools.pptxvdvjdggmgmgelmgleg
1052LaxmanrajS
?
+data_warehousing_hoffer_edm_pp_ch09.ppt
+data_warehousing_hoffer_edm_pp_ch09.ppt+data_warehousing_hoffer_edm_pp_ch09.ppt
+data_warehousing_hoffer_edm_pp_ch09.ppt
aaarashsaadati
?
[aon_presentation EN] global_job_leveling .pdf
[aon_presentation EN] global_job_leveling .pdf[aon_presentation EN] global_job_leveling .pdf
[aon_presentation EN] global_job_leveling .pdf
ssuser87c19a
?
Data-Models-in-DBMS-An-Overview.pptx.pptx
Data-Models-in-DBMS-An-Overview.pptx.pptxData-Models-in-DBMS-An-Overview.pptx.pptx
Data-Models-in-DBMS-An-Overview.pptx.pptx
hfebxtveyjxavhx
?
Kaggle & Datathons: A Practical Guide to AI Competitions
Kaggle & Datathons: A Practical Guide to AI CompetitionsKaggle & Datathons: A Practical Guide to AI Competitions
Kaggle & Datathons: A Practical Guide to AI Competitions
rasheedsrq
?
Presentation.2 .reversal. reversal. pptx
Presentation.2 .reversal. reversal. pptxPresentation.2 .reversal. reversal. pptx
Presentation.2 .reversal. reversal. pptx
siliaselim87
?
Introduction to database and analysis software’s suitable for.pptx
Introduction to database and analysis software’s suitable for.pptxIntroduction to database and analysis software’s suitable for.pptx
Introduction to database and analysis software’s suitable for.pptx
nabinparajuli9
?
"MIAO Ecosystem Financial Management PPT
"MIAO Ecosystem Financial Management PPT"MIAO Ecosystem Financial Management PPT
"MIAO Ecosystem Financial Management PPT
miao22
?
april 2024 paper 2 ms. english non fiction
april 2024 paper 2 ms. english non fictionapril 2024 paper 2 ms. english non fiction
april 2024 paper 2 ms. english non fiction
omokoredeolasunbomi
?
2025-03-03-Philly-AAAI-GoodData-Build Secure RAG Apps With Open LLM
2025-03-03-Philly-AAAI-GoodData-Build Secure RAG Apps With Open LLM2025-03-03-Philly-AAAI-GoodData-Build Secure RAG Apps With Open LLM
2025-03-03-Philly-AAAI-GoodData-Build Secure RAG Apps With Open LLM
Timothy Spann
?
the data analytics process:Real life applications
the data analytics process:Real life applicationsthe data analytics process:Real life applications
the data analytics process:Real life applications
jhanvisaxena30
?
Introduction Lecture 01 Data Science.pdf
Introduction Lecture 01 Data Science.pdfIntroduction Lecture 01 Data Science.pdf
Introduction Lecture 01 Data Science.pdf
messagetome133
?
exampleexampleexampleexampleexampleexampleexampleexample
exampleexampleexampleexampleexampleexampleexampleexampleexampleexampleexampleexampleexampleexampleexampleexample
exampleexampleexampleexampleexampleexampleexampleexample
lembiczkat
?
Valkey 101 - SCaLE 22x March 2025 Stokes.pdf
Valkey 101 - SCaLE 22x March 2025 Stokes.pdfValkey 101 - SCaLE 22x March 2025 Stokes.pdf
Valkey 101 - SCaLE 22x March 2025 Stokes.pdf
Dave Stokes
?
Boosting MySQL with Vector Search Scale22X 2025.pdf
Boosting MySQL with Vector Search Scale22X 2025.pdfBoosting MySQL with Vector Search Scale22X 2025.pdf
Boosting MySQL with Vector Search Scale22X 2025.pdf
Alkin Tezuysal
?
加拿大成绩单购买原版(顿补濒毕业证书)戴尔豪斯大学毕业证文凭
加拿大成绩单购买原版(顿补濒毕业证书)戴尔豪斯大学毕业证文凭加拿大成绩单购买原版(顿补濒毕业证书)戴尔豪斯大学毕业证文凭
加拿大成绩单购买原版(顿补濒毕业证书)戴尔豪斯大学毕业证文凭
taqyed
?
Design Data Model Objects for Analytics, Activation, and AI
Design Data Model Objects for Analytics, Activation, and AIDesign Data Model Objects for Analytics, Activation, and AI
Design Data Model Objects for Analytics, Activation, and AI
aaronmwinters
?
加拿大成绩单购买原版(鲍颁毕业证书)卡尔加里大学毕业证文凭
加拿大成绩单购买原版(鲍颁毕业证书)卡尔加里大学毕业证文凭加拿大成绩单购买原版(鲍颁毕业证书)卡尔加里大学毕业证文凭
加拿大成绩单购买原版(鲍颁毕业证书)卡尔加里大学毕业证文凭
taqyed
?
HIRE MUYERN TRUST HACKER FOR AUTHENTIC CYBER SERVICES
HIRE MUYERN TRUST HACKER FOR AUTHENTIC CYBER SERVICESHIRE MUYERN TRUST HACKER FOR AUTHENTIC CYBER SERVICES
HIRE MUYERN TRUST HACKER FOR AUTHENTIC CYBER SERVICES
anastasiapenova16
?
MLecture 1 Introduction to AI . The basics.pptx
MLecture 1 Introduction to AI . The basics.pptxMLecture 1 Introduction to AI . The basics.pptx
MLecture 1 Introduction to AI . The basics.pptx
FaizaKhan720183
?
data mining tools.pptxvdvjdggmgmgelmgleg
data mining tools.pptxvdvjdggmgmgelmglegdata mining tools.pptxvdvjdggmgmgelmgleg
data mining tools.pptxvdvjdggmgmgelmgleg
1052LaxmanrajS
?
+data_warehousing_hoffer_edm_pp_ch09.ppt
+data_warehousing_hoffer_edm_pp_ch09.ppt+data_warehousing_hoffer_edm_pp_ch09.ppt
+data_warehousing_hoffer_edm_pp_ch09.ppt
aaarashsaadati
?
[aon_presentation EN] global_job_leveling .pdf
[aon_presentation EN] global_job_leveling .pdf[aon_presentation EN] global_job_leveling .pdf
[aon_presentation EN] global_job_leveling .pdf
ssuser87c19a
?
Data-Models-in-DBMS-An-Overview.pptx.pptx
Data-Models-in-DBMS-An-Overview.pptx.pptxData-Models-in-DBMS-An-Overview.pptx.pptx
Data-Models-in-DBMS-An-Overview.pptx.pptx
hfebxtveyjxavhx
?
Kaggle & Datathons: A Practical Guide to AI Competitions
Kaggle & Datathons: A Practical Guide to AI CompetitionsKaggle & Datathons: A Practical Guide to AI Competitions
Kaggle & Datathons: A Practical Guide to AI Competitions
rasheedsrq
?
Presentation.2 .reversal. reversal. pptx
Presentation.2 .reversal. reversal. pptxPresentation.2 .reversal. reversal. pptx
Presentation.2 .reversal. reversal. pptx
siliaselim87
?
Introduction to database and analysis software’s suitable for.pptx
Introduction to database and analysis software’s suitable for.pptxIntroduction to database and analysis software’s suitable for.pptx
Introduction to database and analysis software’s suitable for.pptx
nabinparajuli9
?
"MIAO Ecosystem Financial Management PPT
"MIAO Ecosystem Financial Management PPT"MIAO Ecosystem Financial Management PPT
"MIAO Ecosystem Financial Management PPT
miao22
?
april 2024 paper 2 ms. english non fiction
april 2024 paper 2 ms. english non fictionapril 2024 paper 2 ms. english non fiction
april 2024 paper 2 ms. english non fiction
omokoredeolasunbomi
?
2025-03-03-Philly-AAAI-GoodData-Build Secure RAG Apps With Open LLM
2025-03-03-Philly-AAAI-GoodData-Build Secure RAG Apps With Open LLM2025-03-03-Philly-AAAI-GoodData-Build Secure RAG Apps With Open LLM
2025-03-03-Philly-AAAI-GoodData-Build Secure RAG Apps With Open LLM
Timothy Spann
?
the data analytics process:Real life applications
the data analytics process:Real life applicationsthe data analytics process:Real life applications
the data analytics process:Real life applications
jhanvisaxena30
?
Introduction Lecture 01 Data Science.pdf
Introduction Lecture 01 Data Science.pdfIntroduction Lecture 01 Data Science.pdf
Introduction Lecture 01 Data Science.pdf
messagetome133
?
exampleexampleexampleexampleexampleexampleexampleexample
exampleexampleexampleexampleexampleexampleexampleexampleexampleexampleexampleexampleexampleexampleexampleexample
exampleexampleexampleexampleexampleexampleexampleexample
lembiczkat
?
Valkey 101 - SCaLE 22x March 2025 Stokes.pdf
Valkey 101 - SCaLE 22x March 2025 Stokes.pdfValkey 101 - SCaLE 22x March 2025 Stokes.pdf
Valkey 101 - SCaLE 22x March 2025 Stokes.pdf
Dave Stokes
?
Boosting MySQL with Vector Search Scale22X 2025.pdf
Boosting MySQL with Vector Search Scale22X 2025.pdfBoosting MySQL with Vector Search Scale22X 2025.pdf
Boosting MySQL with Vector Search Scale22X 2025.pdf
Alkin Tezuysal
?
加拿大成绩单购买原版(顿补濒毕业证书)戴尔豪斯大学毕业证文凭
加拿大成绩单购买原版(顿补濒毕业证书)戴尔豪斯大学毕业证文凭加拿大成绩单购买原版(顿补濒毕业证书)戴尔豪斯大学毕业证文凭
加拿大成绩单购买原版(顿补濒毕业证书)戴尔豪斯大学毕业证文凭
taqyed
?
Design Data Model Objects for Analytics, Activation, and AI
Design Data Model Objects for Analytics, Activation, and AIDesign Data Model Objects for Analytics, Activation, and AI
Design Data Model Objects for Analytics, Activation, and AI
aaronmwinters
?
加拿大成绩单购买原版(鲍颁毕业证书)卡尔加里大学毕业证文凭
加拿大成绩单购买原版(鲍颁毕业证书)卡尔加里大学毕业证文凭加拿大成绩单购买原版(鲍颁毕业证书)卡尔加里大学毕业证文凭
加拿大成绩单购买原版(鲍颁毕业证书)卡尔加里大学毕业证文凭
taqyed
?
HIRE MUYERN TRUST HACKER FOR AUTHENTIC CYBER SERVICES
HIRE MUYERN TRUST HACKER FOR AUTHENTIC CYBER SERVICESHIRE MUYERN TRUST HACKER FOR AUTHENTIC CYBER SERVICES
HIRE MUYERN TRUST HACKER FOR AUTHENTIC CYBER SERVICES
anastasiapenova16
?
MLecture 1 Introduction to AI . The basics.pptx
MLecture 1 Introduction to AI . The basics.pptxMLecture 1 Introduction to AI . The basics.pptx
MLecture 1 Introduction to AI . The basics.pptx
FaizaKhan720183
?

Lecture 1-big data engineering (Introduction).pdf

  • 1. 1 Introduction to Big Data Dr. Amira Abdelatey
  • 2. 2 How to manage very large amounts of data and extract value and knowledge from them
  • 3. 3 Introduction ? Big data refers to extremely large and diverse collections of structured, unstructured, and semi-structured data that continues to grow exponentially over time. These datasets are so huge and complex in volume, velocity, and variety, that traditional data management systems cannot store, process, and analyze them. ? The amount and availability of data is growing rapidly, spurred on by digital technology advancements, such as connectivity, mobility, the Internet of Things (IoT), and artificial intelligence (AI). ? Big data tools are emerging to help companies collect, process, and analyze data at the speed needed to gain the most value from it.
  • 4. 4 Introduction What is Big Data? What makes data, “Big” Data?
  • 5. 5 Big Data Definition ? No single standard definition… “Big Data” is data whose scale, diversity, and complexity require new architecture, techniques, algorithms, and analytics to manage it and extract value and hidden knowledge from it…
  • 6. 6 Characteristics of Big Data: 1-Volume (Scale) Exponential increase in collected/generated data ? Data Volume ? 44x increase from 2009 2020 ? From 0.8 zettabytes to 35zb ? Data volume is increasing exponentially
  • 7. 7 Characteristics of Big Data: 2-Variety (Complexity) ? Different types of data ? Relational Data (Tables/Transaction/Legacy Data) ? Text Data (Web) ? Semi-structured Data (XML) ? Graph Data ? Social Network, Semantic Web (RDF), … ? Streaming Data (Stream vs static) ? You can only scan the data once ? A single application can be generating/collecting many types of data ? Big Public Data (online, weather, finance, etc) To extract knowledge? all these types of data need to linked together
  • 8. 8 Characteristics of Big Data: 3-Velocity (Speed) ? Data is begin generated fast and need to be processed fast ? Online Data Analytics ? Late decisions ? missing opportunities ? Examples ? E-Promotions: Based on your current location, your purchase history, what you like ? send promotions right now for store next to you ? Healthcare monitoring: sensors monitoring your activities and body ? any abnormal measurements require immediate reaction
  • 9. 9 Real-time/Fast Data Social media and networks (all of us are generating data) Scientific instruments (collecting all sorts of data) Mobile devices (tracking all objects all the time) Sensor technology and networks (measuring all kinds of data) ? The progress and innovation is no longer hindered by the ability to collect data ? But, by the ability to manage, analyze, summarize, visualize, and discover knowledge from the collected data in a timely manner and in a scalable fashion
  • 10. 10 Real-Time Analytics/Decision Requirement Customer Influence Behavior Product Recommendations that are Relevant & Compelling Friend Invitations to join a Game or Activity that expands business Preventing Fraud as it is Occurring & preventing more proactively Learning why Customers Switch to competitors and their offers; in time to Counter Improving the Marketing Effectiveness of a Promotion while it is still in Play
  • 11. 11 Some Make it 4痴’蝉
  • 12. 12 Harnessing ??????? Big Data ? OLTP: Online Transaction Processing (DBMSs) ? OLAP: Online Analytical Processing (Data Warehousing) ? RTAP: Real-Time Analytics Processing (Big Data Architecture & technology)
  • 13. 13 The Model Has Changed… ? The Model of Generating/Consuming Data has Changed Old Model: Few companies are generating data, all others are consuming data New Model: all of us are generating data, and all of us are consuming data
  • 14. 14 What’s driving Big Data - Ad-hoc querying and reporting - Data mining techniques - Structured data, typical sources - Small to mid-size datasets - Optimizations and predictive analytics - Complex statistical analysis - All types of data, and many sources - Very large datasets - More of a real-time
  • 15. 15 Cloud Computing ? The cloud is a distributed collection of servers that host software and infrastructure, and it is accessed over the Internet ? IT resources provided as a service ? Compute, storage, databases, queues ? Clouds leverage economies of scale of commodity ?????? hardware ? Cheap storage, high bandwidth networks & multicore processors ? Geographically distributed data centers ? Offerings from Microsoft, Amazon, Google, …
  • 17. 17 Benefits ? Cost & management ? Economies of scale, “out-sourced” resource management ? Reduced Time to deployment ? Ease of assembly, works “out of the box” ? Scaling ? On demand provisioning, co-locate data and compute ? Reliability ? Massive, redundant, shared resources ? Sustainability ? Hardware not owned
  • 18. 18 Types of Cloud Computing ? Public Cloud: Computing infrastructure is hosted at the vendor’s premises. ? Private Cloud: Computing architecture is dedicated to the customer and is not shared with other organizations. ? Hybrid Cloud: Organizations host some critical, secure applications in private clouds. The not so critical applications are hosted in the public cloud ? Cloud bursting: the organization uses its own infrastructure for normal usage, but cloud is used for peak loads. ? Community Cloud
  • 19. 19 Classification of Cloud Computing based on Service Provided ? Infrastructure as a service (IaaS) ? Offering hardware related services using the principles of cloud computing. These could include storage services (database or disk storage) or virtual servers. ? Amazon EC2, Amazon S3, Rackspace Cloud Servers and Flexiscale. ? Platform as a Service (PaaS) ? Offering a development platform on the cloud. ? Google’s Application Engine, Microsofts Azure. ? Software as a service (SaaS) ? Including a complete software offering on the cloud. Users can access a software application hosted by the cloud vendor on pay-per-use basis. This is a well-established sector. ? Salesforce.coms’ offering in the online Customer Relationship Management (CRM) space, Googles gmail and Microsofts hotmail, Google docs.
  • 20. 20 Topics overview Section ? Postgres with Python ? ETL with python ? Framework for big data (Python spark) ? Data modelling ? Data warehouse ? Dimensional data modeling ? ETL ? The power of spark ? Big data processing pipeline ? Data wrangling with spark ? Natural Language Processing ? Association rule mining