際際滷

際際滷Share a Scribd company logo
Big Data 
HDInsight and Power BI 
Prasad Prabhu
Big Data - HDInsight and Power BI
WHAT IS BIG DATA?
WHAT IS BIG DATA?
INFO IN TABULAR 
FORMAT 
ROWS 
& 
COLUMNS DEFINED 
SCHEMA 
PRIMARY 
KEY 
RELATIONSHIPS 
FOREIGN 
KEY
How do we analyze this data?
TRADITIONAL DW/BI ENVIRONMENT 
Data Warehouse 
ETL 
ERP/ CRM,
EVOLUTION OF DATA 
Internet of things 
Wikis / Blogs 
Audio / Video 
Log Files 
Text/Image 
Social Sentiment 
Data Market Feeds 
eGov Feeds 
Weather 
Click Stream 
Sensors / RFID / Devices 
Spatial & GPS Coordinates 
WEB 2.0 Mobile 
Advertising eCommerce Collaboration 
Digital Marketing 
Search Marketing 
Web Logs 
Recommendations 
ERP / CRM 
Sales Pipeline 
Payables 
Payroll 
Inventory 
Contacts 
Deal Tracking 
Exabytes 
(10E18) 
Exabytes 
(10E18) 
Petabytes 
(10E15) 
Petabytes 
(10E15) 
Terabytes 
(10E12) 
Gigabytes 
(10E9) 
Velocity - Variety 
Volume 
1980 
190,000$ 
2010 
0.07$ 
1990 
9,000$ 
2000 
15$ 
Storage/GB 
ERP / CRM WEB 2.0 Internet of things 
Terabytes 
(10E12) 
Gigabytes 
(10E9) 
Storage/GB
Big Data - HDInsight and Power BI
DATA IS GROWING
90% 
of the worlds data has been 
created in the last 2 years 
Source:SINTEF
Source: IBM
3 VS OF BIG DATA 
VOLUME 
(Size) 
VARIETY 
(Structure) 
VELOCITY 
(Speed)
How do we handle this massive amount of data 
which comes in different forms and at some speed ?
TOMORROWS DW/BI ENVIRONMENT 
Business Critical 
Data Warehouse 
ETL 
New data sources
WHAT IS HADOOP? 
Apache Hadoop is an open source system to reliably store and process a LOT 
of information across many commodity computers 
Began life as an open source implementation of Googles Map/Reduce and GFS 
papers. Now used at many major web companies at massive scale (1000s of 
node, PBs of storage) 
Key attributes: 
 Open source 
 Highly scalable 
 Runs on commodity hardware 
 Redundant and reliable (no data loss) 
 Batch processing centric  using 
Map-Reduce processing paradigm
2 CORE COMPONENTS OF HADOOP 
Distributed Processing 
(MapReduce) 
Distributed Storage 
(HDFS)
HADOOP IS JUST A FILE SYSTEM 
Head Node 
Data Node Data Node Data Node Data Node Data Node 
File
HADOOP IS JUST A FILE SYSTEM 
Head Node 
Data Node Data Node Data Node Data Node Data Node 
Replicated 3 times 
File 
Read Optimised & Failure Tolerant
MAP + REDUCE = EXTRACT, LOAD + TRANSFORM 
REDUCE MAP 
Raw Data Raw Data Raw Data Raw Data 
Mapper Mapper Mapper Mapper 
Data Data Data Data 
Reducer 
Output
MAP REDUCE ANALOGY  BLOGGER ANALYSIS 
Hi John, 
As you know we are building the blogging platform blogger2.com, I need some statistics. I need to find out, Across all blogs 
ever written on blogger.com, how many times 1 character words occur(like 'a', 'I'), 
How many times two character words occur (like 'be', 'is').. and so on till how many times do ten character words occur. 
 Occurrence of one character words  Around 937688399933 
 Occurrence of two character words  Around 23388383830753434 
 .. hence forth till 10 
I know its a really big job. So, I will assign, all 50,000 employees working in our company 
to work with you on this for a week. I am going on a vacation for a week, and its really 
important that I've this when I return. 
Good luck. 
Regards, 
CEO
THE ECOSYSTEM 
Query 
(Hive) 
Distributed Processing 
(MapReduce) 
Distributed Storage 
(HDFS) 
ODBC 
Legend 
Red = Core 
Hadoop 
Blue = Data 
processing 
Purple = 
Microsoft 
integration 
points and value 
adds 
Orange = Data 
Movement
HADOOP SOLUTIONS
Big Data - HDInsight and Power BI
INTRODUCING HDINSIGHT 
 HDInsight is Microsofts 100% Apache compatible Hadoop distribution 
 Available as a Microsoft Azure service 
 Develop in .NET and Java 
 Built on Hortonworks Data Platform (HDP) 
 Can be automated with PowerShell and Command Line 
 Empowers organizations with new insights on previously untouched 
unstructured data, while connecting to the most widely used BI tools on 
the planet
HDINSIGHT ARCHITECTURE
DEMO
Big Data - HDInsight and Power BI
RUNNING A MAP REDUCE JOB
USE C# - WORD COUNT
CONTINUED..
RUN SQL LIKE COMMANDS USING HIVEQL
COMMON SCENARIOS
SENSOR DATA IN NFL
CLICKSTREAM & HEATMAP
USING EXCEL TO CONNECT TO HDINISGHT
POWER BI = POWER PIVOT + POWER QUERY + POWER MAP
NATURAL LANGUAGE USING POWER BI
SUMMARY 
 Growing data  Not necessarily structured 
 Storage is really cheap 
 Need systems that do not enforce structure on write but on read. 
 Just dont validate but analyze and find patterns, perform exploratory analysis, 
predict outcomes 
 Find ways to make big data simpler to business users  empower them so that 
business can take more informed decisions.
Data Hadoop Analytics
Q&A
http://azure.microsoft.com/bigdata 
http://www.microsoft.com/powerbi 
Sign up for 30 day free trial 
REFERENCE LINKS
THANK YOU

More Related Content

Big Data - HDInsight and Power BI

  • 1. Big Data HDInsight and Power BI Prasad Prabhu
  • 3. WHAT IS BIG DATA?
  • 4. WHAT IS BIG DATA?
  • 5. INFO IN TABULAR FORMAT ROWS & COLUMNS DEFINED SCHEMA PRIMARY KEY RELATIONSHIPS FOREIGN KEY
  • 6. How do we analyze this data?
  • 7. TRADITIONAL DW/BI ENVIRONMENT Data Warehouse ETL ERP/ CRM,
  • 8. EVOLUTION OF DATA Internet of things Wikis / Blogs Audio / Video Log Files Text/Image Social Sentiment Data Market Feeds eGov Feeds Weather Click Stream Sensors / RFID / Devices Spatial & GPS Coordinates WEB 2.0 Mobile Advertising eCommerce Collaboration Digital Marketing Search Marketing Web Logs Recommendations ERP / CRM Sales Pipeline Payables Payroll Inventory Contacts Deal Tracking Exabytes (10E18) Exabytes (10E18) Petabytes (10E15) Petabytes (10E15) Terabytes (10E12) Gigabytes (10E9) Velocity - Variety Volume 1980 190,000$ 2010 0.07$ 1990 9,000$ 2000 15$ Storage/GB ERP / CRM WEB 2.0 Internet of things Terabytes (10E12) Gigabytes (10E9) Storage/GB
  • 11. 90% of the worlds data has been created in the last 2 years Source:SINTEF
  • 13. 3 VS OF BIG DATA VOLUME (Size) VARIETY (Structure) VELOCITY (Speed)
  • 14. How do we handle this massive amount of data which comes in different forms and at some speed ?
  • 15. TOMORROWS DW/BI ENVIRONMENT Business Critical Data Warehouse ETL New data sources
  • 16. WHAT IS HADOOP? Apache Hadoop is an open source system to reliably store and process a LOT of information across many commodity computers Began life as an open source implementation of Googles Map/Reduce and GFS papers. Now used at many major web companies at massive scale (1000s of node, PBs of storage) Key attributes: Open source Highly scalable Runs on commodity hardware Redundant and reliable (no data loss) Batch processing centric using Map-Reduce processing paradigm
  • 17. 2 CORE COMPONENTS OF HADOOP Distributed Processing (MapReduce) Distributed Storage (HDFS)
  • 18. HADOOP IS JUST A FILE SYSTEM Head Node Data Node Data Node Data Node Data Node Data Node File
  • 19. HADOOP IS JUST A FILE SYSTEM Head Node Data Node Data Node Data Node Data Node Data Node Replicated 3 times File Read Optimised & Failure Tolerant
  • 20. MAP + REDUCE = EXTRACT, LOAD + TRANSFORM REDUCE MAP Raw Data Raw Data Raw Data Raw Data Mapper Mapper Mapper Mapper Data Data Data Data Reducer Output
  • 21. MAP REDUCE ANALOGY BLOGGER ANALYSIS Hi John, As you know we are building the blogging platform blogger2.com, I need some statistics. I need to find out, Across all blogs ever written on blogger.com, how many times 1 character words occur(like 'a', 'I'), How many times two character words occur (like 'be', 'is').. and so on till how many times do ten character words occur. Occurrence of one character words Around 937688399933 Occurrence of two character words Around 23388383830753434 .. hence forth till 10 I know its a really big job. So, I will assign, all 50,000 employees working in our company to work with you on this for a week. I am going on a vacation for a week, and its really important that I've this when I return. Good luck. Regards, CEO
  • 22. THE ECOSYSTEM Query (Hive) Distributed Processing (MapReduce) Distributed Storage (HDFS) ODBC Legend Red = Core Hadoop Blue = Data processing Purple = Microsoft integration points and value adds Orange = Data Movement
  • 25. INTRODUCING HDINSIGHT HDInsight is Microsofts 100% Apache compatible Hadoop distribution Available as a Microsoft Azure service Develop in .NET and Java Built on Hortonworks Data Platform (HDP) Can be automated with PowerShell and Command Line Empowers organizations with new insights on previously untouched unstructured data, while connecting to the most widely used BI tools on the planet
  • 27. DEMO
  • 29. RUNNING A MAP REDUCE JOB
  • 30. USE C# - WORD COUNT
  • 32. RUN SQL LIKE COMMANDS USING HIVEQL
  • 36. USING EXCEL TO CONNECT TO HDINISGHT
  • 37. POWER BI = POWER PIVOT + POWER QUERY + POWER MAP
  • 39. SUMMARY Growing data Not necessarily structured Storage is really cheap Need systems that do not enforce structure on write but on read. Just dont validate but analyze and find patterns, perform exploratory analysis, predict outcomes Find ways to make big data simpler to business users empower them so that business can take more informed decisions.
  • 41. Q&A