Big Data is one of the hot topics and has got the attention of the IT industry globally. It is a popular term used to describe the exponential growth and availability of data, both structured and unstructured. And big data may be as important to business and society as the Internet has become. More accurate analyses may lead to more confident decision making. And better decisions can mean greater operational efficiencies, cost reductions and reduced risk.
This presentation focuses on why, what, how of big data as we explore some of Microsoft's big data solutions - HDInsight azure service and PowerBI, providing insights into the world of Big data.
16. WHAT IS HADOOP?
Apache Hadoop is an open source system to reliably store and process a LOT
of information across many commodity computers
Began life as an open source implementation of Googles Map/Reduce and GFS
papers. Now used at many major web companies at massive scale (1000s of
node, PBs of storage)
Key attributes:
Open source
Highly scalable
Runs on commodity hardware
Redundant and reliable (no data loss)
Batch processing centric using
Map-Reduce processing paradigm
18. HADOOP IS JUST A FILE SYSTEM
Head Node
Data Node Data Node Data Node Data Node Data Node
File
19. HADOOP IS JUST A FILE SYSTEM
Head Node
Data Node Data Node Data Node Data Node Data Node
Replicated 3 times
File
Read Optimised & Failure Tolerant
20. MAP + REDUCE = EXTRACT, LOAD + TRANSFORM
REDUCE MAP
Raw Data Raw Data Raw Data Raw Data
Mapper Mapper Mapper Mapper
Data Data Data Data
Reducer
Output
21. MAP REDUCE ANALOGY BLOGGER ANALYSIS
Hi John,
As you know we are building the blogging platform blogger2.com, I need some statistics. I need to find out, Across all blogs
ever written on blogger.com, how many times 1 character words occur(like 'a', 'I'),
How many times two character words occur (like 'be', 'is').. and so on till how many times do ten character words occur.
Occurrence of one character words Around 937688399933
Occurrence of two character words Around 23388383830753434
.. hence forth till 10
I know its a really big job. So, I will assign, all 50,000 employees working in our company
to work with you on this for a week. I am going on a vacation for a week, and its really
important that I've this when I return.
Good luck.
Regards,
CEO
22. THE ECOSYSTEM
Query
(Hive)
Distributed Processing
(MapReduce)
Distributed Storage
(HDFS)
ODBC
Legend
Red = Core
Hadoop
Blue = Data
processing
Purple =
Microsoft
integration
points and value
adds
Orange = Data
Movement
25. INTRODUCING HDINSIGHT
HDInsight is Microsofts 100% Apache compatible Hadoop distribution
Available as a Microsoft Azure service
Develop in .NET and Java
Built on Hortonworks Data Platform (HDP)
Can be automated with PowerShell and Command Line
Empowers organizations with new insights on previously untouched
unstructured data, while connecting to the most widely used BI tools on
the planet
39. SUMMARY
Growing data Not necessarily structured
Storage is really cheap
Need systems that do not enforce structure on write but on read.
Just dont validate but analyze and find patterns, perform exploratory analysis,
predict outcomes
Find ways to make big data simpler to business users empower them so that
business can take more informed decisions.