狠狠撸

狠狠撸Share a Scribd company logo
Big Data Analytics
Module 1
Introduction to Big Data
Data
? Data is a set of values that represent a concept or concepts. It can be raw
information, such as numbers or text, or it can be more complex, such as images,
graphics, or videos.
Characteristics of Data
Composition: deals with structure of data, that is, the sources of data, the types, and
the nature of the data as to whether it is static or real-time streaming.
Condition: The condition of data deals with the state of the data that is “can one use
this data as is for analysis?” or “Does it require cleansing for further enhancement and
enrichment?”
Context: deals with “Where has this data been generated?”, “Why was this data
generated?” and so on.
In simple terms, characteristics of data includes
? Accuracy
? Completeness
? Consistency
? Timeliness
? Validity
? Uniqueness
Characteristics of Big Data
The characteristics of big data includes,
Evolution of Big Data
? 1970s and before – Mainframe: Basic Data Storage, Data has a structure.
? 1980s and 1990s – Relational Databases: It has a structure and relationship of the
data.
? 2000s and beyond – Structured, Unstructured and Multimedia data in the form of
WWW.
There are a lot of milestones in the evolution of Big Data which are described below:
Data Warehousing:
In the 1990s, data warehousing emerged as a solution to store and analyze large
volumes of structured data.
Hadoop:
Hadoop was introduced in 2006 by Doug Cutting and Mike Cafarella. Distributed storage
medium and large data processing are provided by Hadoop, and it is an open-source
framework.
Evolution of Big Data
NoSQL Databases:
In 2009, NoSQL databases were introduced, which provide a flexible way to store and
retrieve unstructured data.
At present, technologies like cloud computing, machine learning are widely used by
companies for reducing the maintenance cost and infrastructure cost and also to get the
proper insights from the big data effectively.
Challenges with Big Data
? Data Volume: Managing and Storing Massive Amounts of Data
? Data Variety: Handling Diverse Data Types
? Data Velocity: Processing Data in Real-Time
? Data Veracity: Ensuring Data Quality and Accuracy
? Data Security and Privacy: Protecting Sensitive Information
? Data Integration: Combining Data from Multiple Sources
? Data Analytics: Extracting Valuable Insights
? Data Governance: Establishing Policies and Standards
Data Warehouse Environment
Data Warehouse Environment
Traditional Business Intelligence versus Big Data
Importance of Big Data
? Enhanced Decision-Making (vast amounts of data, discovering new patterns and
trends)
? Understanding Consumer Behavior (for recommendations)
? Competitive Advantage (Competitor analysis, market trends)
? Innovation and New Opportunities (reveals gaps in existing products or services)
? Efficiency and Cost Reduction (optimize processes for reducing waste and improve
resource allocation)
? Improved Risk Management (advanced modelling and simulation)
? Enhanced Public Services (traffic management and disease control)
? Better Workforce Insights (employee engagement, performance, and retention)
? AI and Machine Learning (predict accurately)
? Advancements in Research (academics, healthcare etc.,)
Big Data Technologies
Big data technologies can be categorized into four main types:
? Data storage,
? Data mining,
? Data analytics and
? Data visualization.
Big Data Technologies
1. Data Storage:
Big data technology that deals with data storage has the capability to fetch, store, and
manage big data. Two commonly used tools are Hadoop and MongoDB.
Hadoop:
? It is the most widely used big data tool.
? It is an open-source software platform which allows for faster data processing.
? The framework is designed to reduce bugs or faults and process all data formats.
MongoDB:
? It is a NoSQL database that can be used to store large volumes of data using key-value
pairs.
? It is a most popular big data databases because it can manage and store unstructured
data.
Big Data Technologies
2. Data mining
Data mining extracts the useful patterns and trends from the raw data. Big data
technologies such as Rapidminer and Presto can turn unstructured and structured data
into usable information.
Rapidminer:
? Rapidminer is a data mining tool that can be used to build predictive models.
? It is used for processing and preparing data, and building machine and deep learning
models.
Presto:
? Presto is an open-source query engine that was originally developed by Facebook to
run analytic queries against their large datasets. Now, it is available widely.
? One query on Presto can combine data from multiple sources within an organization
and perform analytics on them.
Big Data Technologies
3. Data analytics
In big data analytics, technologies are used to clean and transform data into information
that can be used to drive business decisions. This next step (after data mining) is where
users perform algorithms, models, and predictive analytics using tools such as Spark and
Splunk.
Spark:
? Spark is a popular big data tool for data analysis because it is fast and efficient at
running applications.
? Spark supports a wide variety of data analytics tasks and queries.
Splunk:
? Splunk is another popular big data analytics tool for deriving insights from large
datasets. It has the ability to generate graphs, charts, reports, and dashboards.
? Splunk also enables users to incorporate artificial intelligence (AI) into data outcomes.
Big Data Technologies
4. Data visualization
Finally, big data technologies can be used to create good visualizations from the data. In
data-oriented roles, data visualization is a skill that is beneficial for presenting
recommendations to stakeholders for business profitability and operations—to tell an
impactful story with a simple graph.
Tableau:
? Tableau is a very popular tool in data visualization because its drag-and-drop interface
makes it easy to create pie charts, bar charts, box plots, Gantt charts, and more.
? It is a secure platform that allows users to share visualizations and dashboards in real
time.
Looker:
? Looker is a business intelligence (BI) tool used to make sense of big data analytics and
then share those insights with other teams.
? Charts, graphs, and dashboards can be configured with a query, such as monitoring
weekly brand engagement through social media analytics.
What kind of Technologies are we looking
toward to meet the challenges posed by big
data?
1. The first requirement is cheap and abundant storage.
2. Need fast processors for quick processing of big data.
3. Open source.
4. Advanced analysis.
5. Resource allocation arrangements.
Data Science
? Data science is the science of extracting knowledge from data.
? It is a science of drawing out hidden patterns amongst data using statistical and
mathematical techniques.
? It is a multidisciplinary approach that combines principles and practices from the fields
of mathematics, statistics, artificial intelligence, and computer engineering to analyze
large amounts of data.
? This analysis helps data scientists to ask and answer questions like what happened,
why it happened, what will happen, and what can be done with the results.
The basic business acumen skills required are
1. Understanding of Domain
2. Business Strategy
3. Problem Solving
4. Communication
Responsibilities of Data Scientist
? Prepares and integrates large and varied datasets
? Applies business domain knowledge to provide context
? Models and analyses to comprehend, interpret relationships, patterns and trends
? Communicates / presents the findings and results.
In simple words, the responsibilities of data scientist includes,
? Data Management
? Applying Analytical Techniques
? Communicating with the Stakeholders
Big Data Analytics M1.pdf big data analytics
Soft state Eventual consistency
Soft state refers to a system design principle where the state of a system or its data is
allowed to change over time, even without direct user interaction.
Eventual consistency is a consistency model used in distributed systems where updates
to a data item are propagated asynchronously across nodes.
Role / Elements of Big Data Ecosystem
The elements of big data ecosystem includes,
1. Sensing
2. Collection
3. Wrangling
4. Analysis
5. Storage
Role / Elements of Big Data Ecosystem
1. Sensing
Sensing refers to the process of identifying data sources for your project.
This evaluation includes asking such questions as:
? Is the data accurate?
? Is the data recent and up to date?
? Is the data complete? Is the data valid? Can it be trusted?
Key pieces of the data ecosystem leveraged in this stage include:
? Internal data sources: Spreadsheets, and other resources that originate from within
organization.
? External data sources: Databases, spreadsheets, websites that originate from outside
your organization.
? Software: Custom software that exists for the sole purpose of data sensing.
? Algorithms: A set of steps or rules that automates the process of evaluating data for
accuracy and completion before it’s used.
Role / Elements of Big Data Ecosystem
2. Collection
Once a potential data source has been identified, data must be collected. Data collection
can be completed through manual or automated processes.
Key pieces of the data ecosystem leveraged in this stage include:
? Various programming languages: These include R, Python, SQL, and JavaScript.
? Code packages and libraries: Existing code that’s been written and tested and allows
data scientists to generate programs more quickly and efficiently.
? APIs (Application Programming Interface): Software programs designed to interact
with other applications and extract data.
Role / Elements of Big Data Ecosystem
3. Wrangling
? Data wrangling is a set of processes designed to transform raw data into a more usable
format.
? Depending on the quality of the data in question, it may involve merging multiple
datasets, identifying and filling gaps in data, deleting unnecessary or incorrect data,
and “cleaning” and structuring data for future analysis.
Key pieces of the data ecosystem leveraged in this stage include:
? Algorithms: A series of steps or rules to be followed to solve a problem.
? Various programming languages: These include R, Python, SQL, and JavaScript, and
can be used to write algorithms.
Role / Elements of Big Data Ecosystem
4. Analysis
? After raw data has been inspected and transformed into a readily usable state, it can
be analyzed. wrangling is a set of processes designed to transform raw data into a
more usable format.
? Depending on the quality of the data in question, it may involve merging multiple
datasets, identifying and filling gaps in data, deleting unnecessary or incorrect data,
and “cleaning” and structuring data for future analysis.
Key pieces of the data ecosystem leveraged in this stage include:
? Algorithms: A series of steps or rules to be followed to solve a problem.
? Various programming languages: These include R, Python, SQL, and JavaScript, and
can be used to write algorithms.
Role / Elements of Big Data Ecosystem
5. Storage
? Throughout all of the data life cycle stages, data must be stored in a way that’s both
secure and accessible.
Key pieces of the data ecosystem leveraged in this stage include:
? Cloud-based storage solutions: These allow an organization to store data off-site and
access it remotely.
? On-site servers: These give organizations a greater sense of control over how data is
stored and used.
? Other storage media: These include hard drives, USB devices, CD-ROMs, and floppy
disks

More Related Content

Similar to Big Data Analytics M1.pdf big data analytics (20)

DATA SCIENCE IS CATALYZING BUSINESS AND INNOVATION
DATA SCIENCE IS CATALYZING BUSINESS AND INNOVATION DATA SCIENCE IS CATALYZING BUSINESS AND INNOVATION
DATA SCIENCE IS CATALYZING BUSINESS AND INNOVATION
Elvis Muyanja
?
Lecture1 introduction to big data
Lecture1 introduction to big dataLecture1 introduction to big data
Lecture1 introduction to big data
hktripathy
?
Unit-1 -2-3- BDA PIET 6 AIDS.pptx
Unit-1 -2-3- BDA PIET 6 AIDS.pptxUnit-1 -2-3- BDA PIET 6 AIDS.pptx
Unit-1 -2-3- BDA PIET 6 AIDS.pptx
YashiBatra1
?
Ch~2.pdf
Ch~2.pdfCh~2.pdf
Ch~2.pdf
andualemtemesgen3
?
Fundamentals of data science: digital data
Fundamentals of data science: digital dataFundamentals of data science: digital data
Fundamentals of data science: digital data
lokeshsd14
?
Data Science ppt for the asjdbhsadbmsnc.pptx
Data Science ppt for the asjdbhsadbmsnc.pptxData Science ppt for the asjdbhsadbmsnc.pptx
Data Science ppt for the asjdbhsadbmsnc.pptx
sa3302
?
Intro big data analytics
Intro big data analyticsIntro big data analytics
Intro big data analytics
Hagar Alaa el-din
?
Big Data Evolution
Big Data EvolutionBig Data Evolution
Big Data Evolution
itnewsafrica
?
This is abouts are you doing the same time who is the best person to be safe and
This is abouts are you doing the same time who is the best person to be safe andThis is abouts are you doing the same time who is the best person to be safe and
This is abouts are you doing the same time who is the best person to be safe and
codekeliyehai
?
Introduction to Data Analytics, AKTU - UNIT-1
Introduction to Data Analytics, AKTU - UNIT-1Introduction to Data Analytics, AKTU - UNIT-1
Introduction to Data Analytics, AKTU - UNIT-1
Dr Anuranjan Misra
?
Introducition to Data scinece compiled by hu
Introducition to Data scinece compiled by huIntroducition to Data scinece compiled by hu
Introducition to Data scinece compiled by hu
wekineheshete
?
PresentationBig Data111111111111111.pptx
PresentationBig Data111111111111111.pptxPresentationBig Data111111111111111.pptx
PresentationBig Data111111111111111.pptx
harshadbhaitalpada49
?
000 introduction to big data analytics 2021
000   introduction to big data analytics  2021000   introduction to big data analytics  2021
000 introduction to big data analytics 2021
Dendej Sawarnkatat
?
1 UNIT-DSP.pptx
1 UNIT-DSP.pptx1 UNIT-DSP.pptx
1 UNIT-DSP.pptx
PothyeswariPothyes
?
Introduction to Big Data Analytics
Introduction to Big Data AnalyticsIntroduction to Big Data Analytics
Introduction to Big Data Analytics
Utkarsh Sharma
?
CS3352-Foundations of Data Science Notes.pdf
CS3352-Foundations of Data Science Notes.pdfCS3352-Foundations of Data Science Notes.pdf
CS3352-Foundations of Data Science Notes.pdf
Builders Engineering College
?
Big data unit 2
Big data unit 2Big data unit 2
Big data unit 2
RojaT4
?
20CS601 - Big data Analytics - types of data , definition of big data
20CS601 - Big data Analytics - types of data , definition of big data20CS601 - Big data Analytics - types of data , definition of big data
20CS601 - Big data Analytics - types of data , definition of big data
vani15332
?
Data Science.pptx NEW COURICUUMN IN DATA
Data Science.pptx NEW COURICUUMN IN DATAData Science.pptx NEW COURICUUMN IN DATA
Data Science.pptx NEW COURICUUMN IN DATA
javed75
?
Data science.chapter-1,2,3
Data science.chapter-1,2,3Data science.chapter-1,2,3
Data science.chapter-1,2,3
varshakumar21
?
DATA SCIENCE IS CATALYZING BUSINESS AND INNOVATION
DATA SCIENCE IS CATALYZING BUSINESS AND INNOVATION DATA SCIENCE IS CATALYZING BUSINESS AND INNOVATION
DATA SCIENCE IS CATALYZING BUSINESS AND INNOVATION
Elvis Muyanja
?
Lecture1 introduction to big data
Lecture1 introduction to big dataLecture1 introduction to big data
Lecture1 introduction to big data
hktripathy
?
Unit-1 -2-3- BDA PIET 6 AIDS.pptx
Unit-1 -2-3- BDA PIET 6 AIDS.pptxUnit-1 -2-3- BDA PIET 6 AIDS.pptx
Unit-1 -2-3- BDA PIET 6 AIDS.pptx
YashiBatra1
?
Fundamentals of data science: digital data
Fundamentals of data science: digital dataFundamentals of data science: digital data
Fundamentals of data science: digital data
lokeshsd14
?
Data Science ppt for the asjdbhsadbmsnc.pptx
Data Science ppt for the asjdbhsadbmsnc.pptxData Science ppt for the asjdbhsadbmsnc.pptx
Data Science ppt for the asjdbhsadbmsnc.pptx
sa3302
?
This is abouts are you doing the same time who is the best person to be safe and
This is abouts are you doing the same time who is the best person to be safe andThis is abouts are you doing the same time who is the best person to be safe and
This is abouts are you doing the same time who is the best person to be safe and
codekeliyehai
?
Introduction to Data Analytics, AKTU - UNIT-1
Introduction to Data Analytics, AKTU - UNIT-1Introduction to Data Analytics, AKTU - UNIT-1
Introduction to Data Analytics, AKTU - UNIT-1
Dr Anuranjan Misra
?
Introducition to Data scinece compiled by hu
Introducition to Data scinece compiled by huIntroducition to Data scinece compiled by hu
Introducition to Data scinece compiled by hu
wekineheshete
?
PresentationBig Data111111111111111.pptx
PresentationBig Data111111111111111.pptxPresentationBig Data111111111111111.pptx
PresentationBig Data111111111111111.pptx
harshadbhaitalpada49
?
000 introduction to big data analytics 2021
000   introduction to big data analytics  2021000   introduction to big data analytics  2021
000 introduction to big data analytics 2021
Dendej Sawarnkatat
?
Introduction to Big Data Analytics
Introduction to Big Data AnalyticsIntroduction to Big Data Analytics
Introduction to Big Data Analytics
Utkarsh Sharma
?
Big data unit 2
Big data unit 2Big data unit 2
Big data unit 2
RojaT4
?
20CS601 - Big data Analytics - types of data , definition of big data
20CS601 - Big data Analytics - types of data , definition of big data20CS601 - Big data Analytics - types of data , definition of big data
20CS601 - Big data Analytics - types of data , definition of big data
vani15332
?
Data Science.pptx NEW COURICUUMN IN DATA
Data Science.pptx NEW COURICUUMN IN DATAData Science.pptx NEW COURICUUMN IN DATA
Data Science.pptx NEW COURICUUMN IN DATA
javed75
?
Data science.chapter-1,2,3
Data science.chapter-1,2,3Data science.chapter-1,2,3
Data science.chapter-1,2,3
varshakumar21
?

Recently uploaded (20)

加拿大成绩单购买原版(鲍颁毕业证书)卡尔加里大学毕业证文凭
加拿大成绩单购买原版(鲍颁毕业证书)卡尔加里大学毕业证文凭加拿大成绩单购买原版(鲍颁毕业证书)卡尔加里大学毕业证文凭
加拿大成绩单购买原版(鲍颁毕业证书)卡尔加里大学毕业证文凭
taqyed
?
The Marketability of Rice Straw Yarn Among Selected Customers of Gantsilyo Guru
The Marketability of Rice Straw Yarn Among Selected Customers of Gantsilyo GuruThe Marketability of Rice Straw Yarn Among Selected Customers of Gantsilyo Guru
The Marketability of Rice Straw Yarn Among Selected Customers of Gantsilyo Guru
kenyoncenteno12
?
RAGing Against the Literature: LLM-Powered Dataset Mention Extraction - prese...
RAGing Against the Literature: LLM-Powered Dataset Mention Extraction - prese...RAGing Against the Literature: LLM-Powered Dataset Mention Extraction - prese...
RAGing Against the Literature: LLM-Powered Dataset Mention Extraction - prese...
suchanadatta3
?
iam free indeed.pptxiam free indeed.pptx
iam free indeed.pptxiam free indeed.pptxiam free indeed.pptxiam free indeed.pptx
iam free indeed.pptxiam free indeed.pptx
muhweziart
?
How to Write Doc Comments for the Javadoc Tool.docx
How to Write Doc Comments for the Javadoc Tool.docxHow to Write Doc Comments for the Javadoc Tool.docx
How to Write Doc Comments for the Javadoc Tool.docx
vikramsingh770427
?
AI + Disability. Coded Futures: Better opportunities or biased outcomes?
AI + Disability. Coded Futures: Better opportunities or biased outcomes?AI + Disability. Coded Futures: Better opportunities or biased outcomes?
AI + Disability. Coded Futures: Better opportunities or biased outcomes?
Christine Hemphill
?
data mining tools.pptxvdvjdggmgmgelmgleg
data mining tools.pptxvdvjdggmgmgelmglegdata mining tools.pptxvdvjdggmgmgelmgleg
data mining tools.pptxvdvjdggmgmgelmgleg
1052LaxmanrajS
?
MLecture 1 Introduction to AI . The basics.pptx
MLecture 1 Introduction to AI . The basics.pptxMLecture 1 Introduction to AI . The basics.pptx
MLecture 1 Introduction to AI . The basics.pptx
FaizaKhan720183
?
原版复刻加拿大多伦多大学成绩单(UTSG毕业证书) 文凭
原版复刻加拿大多伦多大学成绩单(UTSG毕业证书) 文凭原版复刻加拿大多伦多大学成绩单(UTSG毕业证书) 文凭
原版复刻加拿大多伦多大学成绩单(UTSG毕业证书) 文凭
taqyed
?
Optimizing Common Table Expressions in Apache Hive with Calcite
Optimizing Common Table Expressions in Apache Hive with CalciteOptimizing Common Table Expressions in Apache Hive with Calcite
Optimizing Common Table Expressions in Apache Hive with Calcite
Stamatis Zampetakis
?
Design Data Model Objects for Analytics, Activation, and AI
Design Data Model Objects for Analytics, Activation, and AIDesign Data Model Objects for Analytics, Activation, and AI
Design Data Model Objects for Analytics, Activation, and AI
aaronmwinters
?
stages-of-moral-development-lawrence-kohlberg-pdf-free.pdf
stages-of-moral-development-lawrence-kohlberg-pdf-free.pdfstages-of-moral-development-lawrence-kohlberg-pdf-free.pdf
stages-of-moral-development-lawrence-kohlberg-pdf-free.pdf
esguerramark1991
?
Rosa_Ivelisse_PublishingCompanyPitch(1).docx
Rosa_Ivelisse_PublishingCompanyPitch(1).docxRosa_Ivelisse_PublishingCompanyPitch(1).docx
Rosa_Ivelisse_PublishingCompanyPitch(1).docx
irramos8843
?
Kaggle & Datathons: A Practical Guide to AI Competitions
Kaggle & Datathons: A Practical Guide to AI CompetitionsKaggle & Datathons: A Practical Guide to AI Competitions
Kaggle & Datathons: A Practical Guide to AI Competitions
rasheedsrq
?
MusicAggregators and comparissons distributions.pdf
MusicAggregators and comparissons distributions.pdfMusicAggregators and comparissons distributions.pdf
MusicAggregators and comparissons distributions.pdf
irramos8843
?
exampleexampleexampleexampleexampleexampleexampleexample
exampleexampleexampleexampleexampleexampleexampleexampleexampleexampleexampleexampleexampleexampleexampleexample
exampleexampleexampleexampleexampleexampleexampleexample
lembiczkat
?
CloudMonitor - Architecture Audit Review February 2025.pdf
CloudMonitor - Architecture Audit Review February 2025.pdfCloudMonitor - Architecture Audit Review February 2025.pdf
CloudMonitor - Architecture Audit Review February 2025.pdf
Rodney Joyce
?
The truth behind the numbers: spotting statistical misuse.pptx
The truth behind the numbers: spotting statistical misuse.pptxThe truth behind the numbers: spotting statistical misuse.pptx
The truth behind the numbers: spotting statistical misuse.pptx
andyprosser3
?
2024-07-eb-big-book-of-data-engineering-3rd-edition.pdf
2024-07-eb-big-book-of-data-engineering-3rd-edition.pdf2024-07-eb-big-book-of-data-engineering-3rd-edition.pdf
2024-07-eb-big-book-of-data-engineering-3rd-edition.pdf
AlexandreMacedo50
?
Presentation1.pptx for data and table analysis
Presentation1.pptx for data and table analysisPresentation1.pptx for data and table analysis
Presentation1.pptx for data and table analysis
vatsalsingla4
?
加拿大成绩单购买原版(鲍颁毕业证书)卡尔加里大学毕业证文凭
加拿大成绩单购买原版(鲍颁毕业证书)卡尔加里大学毕业证文凭加拿大成绩单购买原版(鲍颁毕业证书)卡尔加里大学毕业证文凭
加拿大成绩单购买原版(鲍颁毕业证书)卡尔加里大学毕业证文凭
taqyed
?
The Marketability of Rice Straw Yarn Among Selected Customers of Gantsilyo Guru
The Marketability of Rice Straw Yarn Among Selected Customers of Gantsilyo GuruThe Marketability of Rice Straw Yarn Among Selected Customers of Gantsilyo Guru
The Marketability of Rice Straw Yarn Among Selected Customers of Gantsilyo Guru
kenyoncenteno12
?
RAGing Against the Literature: LLM-Powered Dataset Mention Extraction - prese...
RAGing Against the Literature: LLM-Powered Dataset Mention Extraction - prese...RAGing Against the Literature: LLM-Powered Dataset Mention Extraction - prese...
RAGing Against the Literature: LLM-Powered Dataset Mention Extraction - prese...
suchanadatta3
?
iam free indeed.pptxiam free indeed.pptx
iam free indeed.pptxiam free indeed.pptxiam free indeed.pptxiam free indeed.pptx
iam free indeed.pptxiam free indeed.pptx
muhweziart
?
How to Write Doc Comments for the Javadoc Tool.docx
How to Write Doc Comments for the Javadoc Tool.docxHow to Write Doc Comments for the Javadoc Tool.docx
How to Write Doc Comments for the Javadoc Tool.docx
vikramsingh770427
?
AI + Disability. Coded Futures: Better opportunities or biased outcomes?
AI + Disability. Coded Futures: Better opportunities or biased outcomes?AI + Disability. Coded Futures: Better opportunities or biased outcomes?
AI + Disability. Coded Futures: Better opportunities or biased outcomes?
Christine Hemphill
?
data mining tools.pptxvdvjdggmgmgelmgleg
data mining tools.pptxvdvjdggmgmgelmglegdata mining tools.pptxvdvjdggmgmgelmgleg
data mining tools.pptxvdvjdggmgmgelmgleg
1052LaxmanrajS
?
MLecture 1 Introduction to AI . The basics.pptx
MLecture 1 Introduction to AI . The basics.pptxMLecture 1 Introduction to AI . The basics.pptx
MLecture 1 Introduction to AI . The basics.pptx
FaizaKhan720183
?
原版复刻加拿大多伦多大学成绩单(UTSG毕业证书) 文凭
原版复刻加拿大多伦多大学成绩单(UTSG毕业证书) 文凭原版复刻加拿大多伦多大学成绩单(UTSG毕业证书) 文凭
原版复刻加拿大多伦多大学成绩单(UTSG毕业证书) 文凭
taqyed
?
Optimizing Common Table Expressions in Apache Hive with Calcite
Optimizing Common Table Expressions in Apache Hive with CalciteOptimizing Common Table Expressions in Apache Hive with Calcite
Optimizing Common Table Expressions in Apache Hive with Calcite
Stamatis Zampetakis
?
Design Data Model Objects for Analytics, Activation, and AI
Design Data Model Objects for Analytics, Activation, and AIDesign Data Model Objects for Analytics, Activation, and AI
Design Data Model Objects for Analytics, Activation, and AI
aaronmwinters
?
stages-of-moral-development-lawrence-kohlberg-pdf-free.pdf
stages-of-moral-development-lawrence-kohlberg-pdf-free.pdfstages-of-moral-development-lawrence-kohlberg-pdf-free.pdf
stages-of-moral-development-lawrence-kohlberg-pdf-free.pdf
esguerramark1991
?
Rosa_Ivelisse_PublishingCompanyPitch(1).docx
Rosa_Ivelisse_PublishingCompanyPitch(1).docxRosa_Ivelisse_PublishingCompanyPitch(1).docx
Rosa_Ivelisse_PublishingCompanyPitch(1).docx
irramos8843
?
Kaggle & Datathons: A Practical Guide to AI Competitions
Kaggle & Datathons: A Practical Guide to AI CompetitionsKaggle & Datathons: A Practical Guide to AI Competitions
Kaggle & Datathons: A Practical Guide to AI Competitions
rasheedsrq
?
MusicAggregators and comparissons distributions.pdf
MusicAggregators and comparissons distributions.pdfMusicAggregators and comparissons distributions.pdf
MusicAggregators and comparissons distributions.pdf
irramos8843
?
exampleexampleexampleexampleexampleexampleexampleexample
exampleexampleexampleexampleexampleexampleexampleexampleexampleexampleexampleexampleexampleexampleexampleexample
exampleexampleexampleexampleexampleexampleexampleexample
lembiczkat
?
CloudMonitor - Architecture Audit Review February 2025.pdf
CloudMonitor - Architecture Audit Review February 2025.pdfCloudMonitor - Architecture Audit Review February 2025.pdf
CloudMonitor - Architecture Audit Review February 2025.pdf
Rodney Joyce
?
The truth behind the numbers: spotting statistical misuse.pptx
The truth behind the numbers: spotting statistical misuse.pptxThe truth behind the numbers: spotting statistical misuse.pptx
The truth behind the numbers: spotting statistical misuse.pptx
andyprosser3
?
2024-07-eb-big-book-of-data-engineering-3rd-edition.pdf
2024-07-eb-big-book-of-data-engineering-3rd-edition.pdf2024-07-eb-big-book-of-data-engineering-3rd-edition.pdf
2024-07-eb-big-book-of-data-engineering-3rd-edition.pdf
AlexandreMacedo50
?
Presentation1.pptx for data and table analysis
Presentation1.pptx for data and table analysisPresentation1.pptx for data and table analysis
Presentation1.pptx for data and table analysis
vatsalsingla4
?

Big Data Analytics M1.pdf big data analytics

  • 1. Big Data Analytics Module 1 Introduction to Big Data
  • 2. Data ? Data is a set of values that represent a concept or concepts. It can be raw information, such as numbers or text, or it can be more complex, such as images, graphics, or videos.
  • 3. Characteristics of Data Composition: deals with structure of data, that is, the sources of data, the types, and the nature of the data as to whether it is static or real-time streaming. Condition: The condition of data deals with the state of the data that is “can one use this data as is for analysis?” or “Does it require cleansing for further enhancement and enrichment?” Context: deals with “Where has this data been generated?”, “Why was this data generated?” and so on. In simple terms, characteristics of data includes ? Accuracy ? Completeness ? Consistency ? Timeliness ? Validity ? Uniqueness
  • 4. Characteristics of Big Data The characteristics of big data includes,
  • 5. Evolution of Big Data ? 1970s and before – Mainframe: Basic Data Storage, Data has a structure. ? 1980s and 1990s – Relational Databases: It has a structure and relationship of the data. ? 2000s and beyond – Structured, Unstructured and Multimedia data in the form of WWW. There are a lot of milestones in the evolution of Big Data which are described below: Data Warehousing: In the 1990s, data warehousing emerged as a solution to store and analyze large volumes of structured data. Hadoop: Hadoop was introduced in 2006 by Doug Cutting and Mike Cafarella. Distributed storage medium and large data processing are provided by Hadoop, and it is an open-source framework.
  • 6. Evolution of Big Data NoSQL Databases: In 2009, NoSQL databases were introduced, which provide a flexible way to store and retrieve unstructured data. At present, technologies like cloud computing, machine learning are widely used by companies for reducing the maintenance cost and infrastructure cost and also to get the proper insights from the big data effectively.
  • 7. Challenges with Big Data ? Data Volume: Managing and Storing Massive Amounts of Data ? Data Variety: Handling Diverse Data Types ? Data Velocity: Processing Data in Real-Time ? Data Veracity: Ensuring Data Quality and Accuracy ? Data Security and Privacy: Protecting Sensitive Information ? Data Integration: Combining Data from Multiple Sources ? Data Analytics: Extracting Valuable Insights ? Data Governance: Establishing Policies and Standards
  • 11. Importance of Big Data ? Enhanced Decision-Making (vast amounts of data, discovering new patterns and trends) ? Understanding Consumer Behavior (for recommendations) ? Competitive Advantage (Competitor analysis, market trends) ? Innovation and New Opportunities (reveals gaps in existing products or services) ? Efficiency and Cost Reduction (optimize processes for reducing waste and improve resource allocation) ? Improved Risk Management (advanced modelling and simulation) ? Enhanced Public Services (traffic management and disease control) ? Better Workforce Insights (employee engagement, performance, and retention) ? AI and Machine Learning (predict accurately) ? Advancements in Research (academics, healthcare etc.,)
  • 12. Big Data Technologies Big data technologies can be categorized into four main types: ? Data storage, ? Data mining, ? Data analytics and ? Data visualization.
  • 13. Big Data Technologies 1. Data Storage: Big data technology that deals with data storage has the capability to fetch, store, and manage big data. Two commonly used tools are Hadoop and MongoDB. Hadoop: ? It is the most widely used big data tool. ? It is an open-source software platform which allows for faster data processing. ? The framework is designed to reduce bugs or faults and process all data formats. MongoDB: ? It is a NoSQL database that can be used to store large volumes of data using key-value pairs. ? It is a most popular big data databases because it can manage and store unstructured data.
  • 14. Big Data Technologies 2. Data mining Data mining extracts the useful patterns and trends from the raw data. Big data technologies such as Rapidminer and Presto can turn unstructured and structured data into usable information. Rapidminer: ? Rapidminer is a data mining tool that can be used to build predictive models. ? It is used for processing and preparing data, and building machine and deep learning models. Presto: ? Presto is an open-source query engine that was originally developed by Facebook to run analytic queries against their large datasets. Now, it is available widely. ? One query on Presto can combine data from multiple sources within an organization and perform analytics on them.
  • 15. Big Data Technologies 3. Data analytics In big data analytics, technologies are used to clean and transform data into information that can be used to drive business decisions. This next step (after data mining) is where users perform algorithms, models, and predictive analytics using tools such as Spark and Splunk. Spark: ? Spark is a popular big data tool for data analysis because it is fast and efficient at running applications. ? Spark supports a wide variety of data analytics tasks and queries. Splunk: ? Splunk is another popular big data analytics tool for deriving insights from large datasets. It has the ability to generate graphs, charts, reports, and dashboards. ? Splunk also enables users to incorporate artificial intelligence (AI) into data outcomes.
  • 16. Big Data Technologies 4. Data visualization Finally, big data technologies can be used to create good visualizations from the data. In data-oriented roles, data visualization is a skill that is beneficial for presenting recommendations to stakeholders for business profitability and operations—to tell an impactful story with a simple graph. Tableau: ? Tableau is a very popular tool in data visualization because its drag-and-drop interface makes it easy to create pie charts, bar charts, box plots, Gantt charts, and more. ? It is a secure platform that allows users to share visualizations and dashboards in real time. Looker: ? Looker is a business intelligence (BI) tool used to make sense of big data analytics and then share those insights with other teams. ? Charts, graphs, and dashboards can be configured with a query, such as monitoring weekly brand engagement through social media analytics.
  • 17. What kind of Technologies are we looking toward to meet the challenges posed by big data? 1. The first requirement is cheap and abundant storage. 2. Need fast processors for quick processing of big data. 3. Open source. 4. Advanced analysis. 5. Resource allocation arrangements.
  • 18. Data Science ? Data science is the science of extracting knowledge from data. ? It is a science of drawing out hidden patterns amongst data using statistical and mathematical techniques. ? It is a multidisciplinary approach that combines principles and practices from the fields of mathematics, statistics, artificial intelligence, and computer engineering to analyze large amounts of data. ? This analysis helps data scientists to ask and answer questions like what happened, why it happened, what will happen, and what can be done with the results. The basic business acumen skills required are 1. Understanding of Domain 2. Business Strategy 3. Problem Solving 4. Communication
  • 19. Responsibilities of Data Scientist ? Prepares and integrates large and varied datasets ? Applies business domain knowledge to provide context ? Models and analyses to comprehend, interpret relationships, patterns and trends ? Communicates / presents the findings and results. In simple words, the responsibilities of data scientist includes, ? Data Management ? Applying Analytical Techniques ? Communicating with the Stakeholders
  • 21. Soft state Eventual consistency Soft state refers to a system design principle where the state of a system or its data is allowed to change over time, even without direct user interaction. Eventual consistency is a consistency model used in distributed systems where updates to a data item are propagated asynchronously across nodes.
  • 22. Role / Elements of Big Data Ecosystem The elements of big data ecosystem includes, 1. Sensing 2. Collection 3. Wrangling 4. Analysis 5. Storage
  • 23. Role / Elements of Big Data Ecosystem 1. Sensing Sensing refers to the process of identifying data sources for your project. This evaluation includes asking such questions as: ? Is the data accurate? ? Is the data recent and up to date? ? Is the data complete? Is the data valid? Can it be trusted? Key pieces of the data ecosystem leveraged in this stage include: ? Internal data sources: Spreadsheets, and other resources that originate from within organization. ? External data sources: Databases, spreadsheets, websites that originate from outside your organization. ? Software: Custom software that exists for the sole purpose of data sensing. ? Algorithms: A set of steps or rules that automates the process of evaluating data for accuracy and completion before it’s used.
  • 24. Role / Elements of Big Data Ecosystem 2. Collection Once a potential data source has been identified, data must be collected. Data collection can be completed through manual or automated processes. Key pieces of the data ecosystem leveraged in this stage include: ? Various programming languages: These include R, Python, SQL, and JavaScript. ? Code packages and libraries: Existing code that’s been written and tested and allows data scientists to generate programs more quickly and efficiently. ? APIs (Application Programming Interface): Software programs designed to interact with other applications and extract data.
  • 25. Role / Elements of Big Data Ecosystem 3. Wrangling ? Data wrangling is a set of processes designed to transform raw data into a more usable format. ? Depending on the quality of the data in question, it may involve merging multiple datasets, identifying and filling gaps in data, deleting unnecessary or incorrect data, and “cleaning” and structuring data for future analysis. Key pieces of the data ecosystem leveraged in this stage include: ? Algorithms: A series of steps or rules to be followed to solve a problem. ? Various programming languages: These include R, Python, SQL, and JavaScript, and can be used to write algorithms.
  • 26. Role / Elements of Big Data Ecosystem 4. Analysis ? After raw data has been inspected and transformed into a readily usable state, it can be analyzed. wrangling is a set of processes designed to transform raw data into a more usable format. ? Depending on the quality of the data in question, it may involve merging multiple datasets, identifying and filling gaps in data, deleting unnecessary or incorrect data, and “cleaning” and structuring data for future analysis. Key pieces of the data ecosystem leveraged in this stage include: ? Algorithms: A series of steps or rules to be followed to solve a problem. ? Various programming languages: These include R, Python, SQL, and JavaScript, and can be used to write algorithms.
  • 27. Role / Elements of Big Data Ecosystem 5. Storage ? Throughout all of the data life cycle stages, data must be stored in a way that’s both secure and accessible. Key pieces of the data ecosystem leveraged in this stage include: ? Cloud-based storage solutions: These allow an organization to store data off-site and access it remotely. ? On-site servers: These give organizations a greater sense of control over how data is stored and used. ? Other storage media: These include hard drives, USB devices, CD-ROMs, and floppy disks