狠狠撸

Big Data Analytics
Module 1
Introduction to Big Data

Data
? Data is a set of values that represent a concept or concepts. It can be raw
information, such as numbers or text, or it can be more complex, such as images,
graphics, or videos.

Characteristics of Data
Composition: deals with structure of data, that is, the sources of data, the types, and
the nature of the data as to whether it is static or real-time streaming.
Condition: The condition of data deals with the state of the data that is “can one use
this data as is for analysis?” or “Does it require cleansing for further enhancement and
enrichment?”
Context: deals with “Where has this data been generated?”, “Why was this data
generated?” and so on.
In simple terms, characteristics of data includes
? Accuracy
? Completeness
? Consistency
? Timeliness
? Validity
? Uniqueness

Characteristics of Big Data
The characteristics of big data includes,

Evolution of Big Data
? 1970s and before – Mainframe: Basic Data Storage, Data has a structure.
? 1980s and 1990s – Relational Databases: It has a structure and relationship of the
data.
? 2000s and beyond – Structured, Unstructured and Multimedia data in the form of
WWW.
There are a lot of milestones in the evolution of Big Data which are described below:
Data Warehousing:
In the 1990s, data warehousing emerged as a solution to store and analyze large
volumes of structured data.
Hadoop:
Hadoop was introduced in 2006 by Doug Cutting and Mike Cafarella. Distributed storage
medium and large data processing are provided by Hadoop, and it is an open-source
framework.

Evolution of Big Data
NoSQL Databases:
In 2009, NoSQL databases were introduced, which provide a flexible way to store and
retrieve unstructured data.
At present, technologies like cloud computing, machine learning are widely used by
companies for reducing the maintenance cost and infrastructure cost and also to get the
proper insights from the big data effectively.

Challenges with Big Data
? Data Volume: Managing and Storing Massive Amounts of Data
? Data Variety: Handling Diverse Data Types
? Data Velocity: Processing Data in Real-Time
? Data Veracity: Ensuring Data Quality and Accuracy
? Data Security and Privacy: Protecting Sensitive Information
? Data Integration: Combining Data from Multiple Sources
? Data Analytics: Extracting Valuable Insights
? Data Governance: Establishing Policies and Standards

Traditional Business Intelligence versus Big Data

Importance of Big Data
? Enhanced Decision-Making (vast amounts of data, discovering new patterns and
trends)
? Understanding Consumer Behavior (for recommendations)
? Competitive Advantage (Competitor analysis, market trends)
? Innovation and New Opportunities (reveals gaps in existing products or services)
? Efficiency and Cost Reduction (optimize processes for reducing waste and improve
resource allocation)
? Improved Risk Management (advanced modelling and simulation)
? Enhanced Public Services (traffic management and disease control)
? Better Workforce Insights (employee engagement, performance, and retention)
? AI and Machine Learning (predict accurately)
? Advancements in Research (academics, healthcare etc.,)

Big Data Technologies
Big data technologies can be categorized into four main types:
? Data storage,
? Data mining,
? Data analytics and
? Data visualization.

1. Data Storage:
Big data technology that deals with data storage has the capability to fetch, store, and
manage big data. Two commonly used tools are Hadoop and MongoDB.
Hadoop:
? It is the most widely used big data tool.
? It is an open-source software platform which allows for faster data processing.
? The framework is designed to reduce bugs or faults and process all data formats.
MongoDB:
? It is a NoSQL database that can be used to store large volumes of data using key-value
pairs.
? It is a most popular big data databases because it can manage and store unstructured
data.

2. Data mining
Data mining extracts the useful patterns and trends from the raw data. Big data
technologies such as Rapidminer and Presto can turn unstructured and structured data
into usable information.
Rapidminer:
? Rapidminer is a data mining tool that can be used to build predictive models.
? It is used for processing and preparing data, and building machine and deep learning
models.
Presto:
? Presto is an open-source query engine that was originally developed by Facebook to
run analytic queries against their large datasets. Now, it is available widely.
? One query on Presto can combine data from multiple sources within an organization
and perform analytics on them.

3. Data analytics
In big data analytics, technologies are used to clean and transform data into information
that can be used to drive business decisions. This next step (after data mining) is where
users perform algorithms, models, and predictive analytics using tools such as Spark and
Splunk.
Spark:
? Spark is a popular big data tool for data analysis because it is fast and efficient at
running applications.
? Spark supports a wide variety of data analytics tasks and queries.
Splunk:
? Splunk is another popular big data analytics tool for deriving insights from large
datasets. It has the ability to generate graphs, charts, reports, and dashboards.
? Splunk also enables users to incorporate artificial intelligence (AI) into data outcomes.

4. Data visualization
Finally, big data technologies can be used to create good visualizations from the data. In
data-oriented roles, data visualization is a skill that is beneficial for presenting
recommendations to stakeholders for business profitability and operations—to tell an
impactful story with a simple graph.
Tableau:
? Tableau is a very popular tool in data visualization because its drag-and-drop interface
makes it easy to create pie charts, bar charts, box plots, Gantt charts, and more.
? It is a secure platform that allows users to share visualizations and dashboards in real
time.
Looker:
? Looker is a business intelligence (BI) tool used to make sense of big data analytics and
then share those insights with other teams.
? Charts, graphs, and dashboards can be configured with a query, such as monitoring
weekly brand engagement through social media analytics.

What kind of Technologies are we looking
toward to meet the challenges posed by big
data?
1. The first requirement is cheap and abundant storage.
2. Need fast processors for quick processing of big data.
3. Open source.
4. Advanced analysis.
5. Resource allocation arrangements.

Data Science
? Data science is the science of extracting knowledge from data.
? It is a science of drawing out hidden patterns amongst data using statistical and
mathematical techniques.
? It is a multidisciplinary approach that combines principles and practices from the fields
of mathematics, statistics, artificial intelligence, and computer engineering to analyze
large amounts of data.
? This analysis helps data scientists to ask and answer questions like what happened,
why it happened, what will happen, and what can be done with the results.
The basic business acumen skills required are
1. Understanding of Domain
2. Business Strategy
3. Problem Solving
4. Communication

Responsibilities of Data Scientist
? Prepares and integrates large and varied datasets
? Applies business domain knowledge to provide context
? Models and analyses to comprehend, interpret relationships, patterns and trends
? Communicates / presents the findings and results.
In simple words, the responsibilities of data scientist includes,
? Data Management
? Applying Analytical Techniques
? Communicating with the Stakeholders

Big Data Analytics M1.pdf big data analytics

Soft state Eventual consistency
Soft state refers to a system design principle where the state of a system or its data is
allowed to change over time, even without direct user interaction.
Eventual consistency is a consistency model used in distributed systems where updates
to a data item are propagated asynchronously across nodes.

Role / Elements of Big Data Ecosystem
The elements of big data ecosystem includes,
1. Sensing
2. Collection
3. Wrangling
4. Analysis
5. Storage

1. Sensing
Sensing refers to the process of identifying data sources for your project.
This evaluation includes asking such questions as:
? Is the data accurate?
? Is the data recent and up to date?
? Is the data complete? Is the data valid? Can it be trusted?
Key pieces of the data ecosystem leveraged in this stage include:
? Internal data sources: Spreadsheets, and other resources that originate from within
organization.
? External data sources: Databases, spreadsheets, websites that originate from outside
your organization.
? Software: Custom software that exists for the sole purpose of data sensing.
? Algorithms: A set of steps or rules that automates the process of evaluating data for
accuracy and completion before it’s used.

2. Collection
Once a potential data source has been identified, data must be collected. Data collection
can be completed through manual or automated processes.
? Various programming languages: These include R, Python, SQL, and JavaScript.
? Code packages and libraries: Existing code that’s been written and tested and allows
data scientists to generate programs more quickly and efficiently.
? APIs (Application Programming Interface): Software programs designed to interact
with other applications and extract data.

3. Wrangling
? Data wrangling is a set of processes designed to transform raw data into a more usable
format.
? Depending on the quality of the data in question, it may involve merging multiple
datasets, identifying and filling gaps in data, deleting unnecessary or incorrect data,
and “cleaning” and structuring data for future analysis.
? Algorithms: A series of steps or rules to be followed to solve a problem.
? Various programming languages: These include R, Python, SQL, and JavaScript, and
can be used to write algorithms.

4. Analysis
? After raw data has been inspected and transformed into a readily usable state, it can
be analyzed. wrangling is a set of processes designed to transform raw data into a
more usable format.
? Depending on the quality of the data in question, it may involve merging multiple
datasets, identifying and filling gaps in data, deleting unnecessary or incorrect data,
and “cleaning” and structuring data for future analysis.
? Algorithms: A series of steps or rules to be followed to solve a problem.
? Various programming languages: These include R, Python, SQL, and JavaScript, and
can be used to write algorithms.

5. Storage
? Throughout all of the data life cycle stages, data must be stored in a way that’s both
secure and accessible.
? Cloud-based storage solutions: These allow an organization to store data off-site and
access it remotely.
? On-site servers: These give organizations a greater sense of control over how data is
stored and used.
? Other storage media: These include hard drives, USB devices, CD-ROMs, and floppy
disks

狠狠撸

Big Data Analytics M1.pdf big data analytics

Recommended

More Related Content

Similar to Big Data Analytics M1.pdf big data analytics (20)

Recently uploaded (20)

Big Data Analytics M1.pdf big data analytics