際際滷

際際滷Share a Scribd company logo
Data Exploration and
Transformation
Structured data
oStructured data is data whose elements are addressable for effective analysis.
oIt has been organized into a formatted repository that is typically a database.
oIt concerns all data which can be stored in database SQL in a table with rows and columns.
oThey have relational keys and can easily be mapped into pre-designed fields.
oThose data are most processed in the development and simplest way to manage information.
o Example: Relational data.
Example of Structured Data
Figure 1 shows customer data of Your Model Car, using a
spreadsheet as an example of structured data. The tabular form and
inherent structure make this type of data analysis-ready, e.g. we
could use a computer to filter the table for customers living in the
USA (the data is machine-readable).
Typically, structured data is stored in spreadsheets (e.g. Excel files)
or in relational databases. These formats also happen to be pretty
human-readable as figure 1 shows. However, this is not always
necessarily the case. Another common storage format of structured
data are comma separated value files (CSV). Figure 2 shows
structured data in csv format.
Pros and Cons of structured data
Pros of structured data
There are three key benefits of structured data:
1. Easily used by machine learning algorithms
2. Easily used by business users
3. Increased access to more tools
Cons of structured data
The cons of structured data are centered in a
lack of data flexibility. Here are some potential
drawbacks to structured datas use:
1. A predefined purpose limits use
2. Limited storage options
Structured data tools
OLAP: Performs high-speed, multidimensional data analysis from unified, centralized
data stores.
SQLite: Implements a self-contained, serverless, zero-configuration, transactional
relational database engine.
MySQL: Embeds data into mass-deployed software, particularly mission-critical, heavy-
load production system.
PostgreSQL: Supports SQL and JSON querying as well as high-tier programming
languages (C/C+, Java, Python, etc.).
Unstructured data
oUnstructured data is a data which is not organized in a predefined manner or does not have a
predefined data model, thus it is not a good fit for a mainstream relational database.
oSo for Unstructured data, there are alternative platforms for storing and managing, it is
increasingly prevalent in IT systems and is used by organizations in a variety of business
intelligence and analytics applications. Example: Word, PDF, Text, Media logs.
oThe vast majority of all data created today is unstructured. Just think of all the text, chat, video
and audio content that is generated every day around the world! Unstructured data is typically
easy to consume for us humans (e.g. images, videos and PDF-documents). But due to the lack of
organization in the data, it is very cumbersome  or even impossible  for a computer to make
sense of it.
Unstructured data examples
There is a plethora of examples of unstructured data. Just think of any image (e.g. jpeg), video
(e.g. mp4), song (e.g. mp3), documents (e.g. PDFs or docx) or any other file type. The image
below shows just one concrete example of unstructured data: a product image and description
text. Even though this type of data might be easy to consume for us humans, it has no degree of
organization and is therefore difficult for machines to analyses and interpret.
Pros and cons of unstructured data
Pros of unstructured data
As there are pros and cons of structured data,
unstructured data also has strengths and
weaknesses for specific business needs. Some
of its benefits include:
1. Freedom of the native format
2. Faster accumulation rates
3. Data lake storage
Cons of unstructured data
There are also cons to using unstructured data.
It requires specific expertise and specialized
tools in order to be used to its fullest potential.
1. Requires data science expertise
2. Specialized tools
Unstructured data tools
MongoDB: Uses flexible documents to process data for cross-platform applications
and services.
DynamoDB: Delivers single-digit millisecond performance at any scale via built-in
security, in-memory caching and backup and restore.
Hadoop: Provides distributed processing of large data sets using simple
programming models and no formatting requirements.
Azure: Enables agile cloud computing for creating and managing apps through
Microsofts data centers.
Quantitative and Qualitative data
Qualitative data
Qualitative data is descriptive and conceptual. Qualitative data can be categorized based on
traits and characteristics.
Qualitative data is non-statistical and is typically unstructured in nature. This data isnt
necessarily measured using hard numbers used to develop graphs and charts. Instead, it is
categorized based on properties, attributes, labels, and other identifiers.
Qualitative data can be used to ask the question why. It is investigative and is often open-
ended until further research is conducted. Generating this data from qualitative research is used
for theorizations, interpretations, developing hypotheses, and initial understandings.
Qualitative data can be generated through:
 Texts and documents
 Audio and video recordings
 Images and symbols
 Interview transcripts and focus groups
 Observations and notes
Pros and cons of Qualitative data
Pros
Better understanding
Provides Explanation
Better Identification of behavior patterns
Cons
Lesser reachability
Time Consuming
Possibility of Bias
Quantitative data
Contrary to qualitative data, quantitative data is statistical and is typically structured in
nature  meaning it is more rigid and defined. This type of data is measured using numbers
and values, which makes it a more suitable candidate for data analysis.
Whereas qualitative is open for exploration, quantitative data is much more concise and
close-ended. It can be used to ask the questions how much or how many, followed by
conclusive information.
Quantitative data can be generated through:
Tests
Experiments
Surveys
Market reports
Metrics
Pros and Cons of Quantitative data
Pros
Specific
High Reliability
Easy communication
Existing support
Cons
Limited Options
High Complexity
Require Expertise
Four Levels of data Measurement
The way a set of data is measured is called its level of measurement. Correct
statistical procedures depend on a researcher being familiar with levels of
measurement. Not every statistical operation can be used with every set of data.
Data can be classified into four levels of measurement. They are (from lowest to
highest level):
1) Nominal level
2) Ordinal level
3) Interval level
4) Ratio level
Nominal Level
Data that is measured using a nominal scale is qualitative. Categories, colors,
names, labels and favorite foods along with yes or no responses are examples
of nominal level data. Nominal scale data are not ordered. Nominal scale data
cannot be used in calculations.
Example:
1.To classify people according to their favorite food, like pizza, spaghetti, and
sushi. Putting pizza first and sushi second is not meaningful.
2.Smartphone companies are another example of nominal scale data. Some
examples are Sony, Motorola, Nokia, Samsung and Apple. This is just a list
and there is no agreed upon order. Some people may favor Apple but that is a
matter of opinion.
Ordinal Level
Data that is measured using an ordinal scale is similar to nominal scale data but there is a
big difference. The ordinal scale data can be ordered. Like the nominal scale data, ordinal
scale data cannot be used in calculations.
Example:
1.A list of the top five national parks in the United States. The top five national parks in
the United States can be ranked from one to five but we cannot measure differences
between the data.
2.A cruise survey where the responses to questions about the cruise are excellent,
good, satisfactory, and unsatisfactory. These responses are ordered from the most
desired response to the least desired. But the differences between two pieces of data
cannot be measured.
Interval Scale Level
Data that is measured using the interval scale is similar to ordinal level data because it has a definite
ordering but there is a difference between data. The differences between interval scale data can be measured
though the data does not have a starting point.
Temperature scales like Celsius (C) and Fahrenheit (F) are measured by using the interval scale. In both
temperature measurements, 40属 is equal to 100属 minus 60属. Differences make sense. But 0 degrees does not
because, in both scales, 0 is not the absolute lowest temperature. Temperatures like -10属 F and -15属 C exist
and are colder than 0.
Interval level data can be used in calculations, but comparison cannot be done. 80属 C is not four times as
hot as 20属 C (nor is 80属 F four times as hot as 20属 F). There is no meaning to the ratio of 80 to 20 (or four to
one).
Example:
1.Monthly income of 2000 part-time students in Texas
2.Highest daily temperature in Odessa
Ratio Scale Level
Data that is measured using the ratio scale takes care of the ratio problem and gives you the most
information. Ratio scale data is like interval scale data, but it has a 0 point and ratios can be calculated.
You will not have a negative value in ratio scale data.
For example, four multiple choice statistics final exam scores are 80, 68, 20 and 92 (out of a possible 100
points) (given that the exams are machine-graded.) The data can be put in order from lowest to highest:
20, 68, 80, 92. There is no negative point in the final exam scores as the lowest score is 0 point.
The differences between the data have meaning. The score 92 is more than the score 68 by 24 points.
Ratios can be calculated. The smallest score is 0. So 80 is four times 20. If one student scores 80 points
and another student scores 20 points, the student who scores higher is 4 times better than the student who
scores lower.
Example:
1.Weight of 200 cancer patients in the past 5 months
2.Height of 549 newborn babies
3.Diameter of 150 donuts
Data Cleaning
Data cleaning is the process of preparing data for analysis by removing or
modifying data that is incorrect, incomplete, irrelevant, duplicated, or improperly
formatted.
This data is usually not necessary or helpful when it comes to analyzing data
because it may hinder the process or provide inaccurate results. There are several
methods for cleaning data depending on how it is stored along with the answers
being sought.
Data cleaning is not simply about erasing information to make space for new
data, but rather finding a way to maximize a data sets accuracy without
necessarily deleting information.
How do you clean data?
Step 1: Remove duplicate or irrelevant observations
Step 2: Fix structural errors
Step 3: Filter unwanted outliers
Step 4: Handle missing data
Step 4: Validate

More Related Content

Similar to Data Exploration and Transformation.pptx (20)

DOCX
SOCIAL ISSUES DISCUSSION You are required to identify any curr.docx
pbilly1
PPTX
Artificial Intelligence - Data Analysis, Creative & Critical Thinking and AI...
Deboshree Chatterjee
PPTX
Types of Research Data.pptx
GERLIETAGALO1
PPTX
What is Data?
Ranjit Nambisan
PPTX
Data mining Basics and complete description
Sulman Ahmed
PPTX
Introduction to Data (1).pptx
SubhamitaKanungo
PPTX
Four data types Data Scientist should know
Ranjit Nambisan
PDF
Data collection, Data Integration, Data Understanding e Data Cleaning & Prepa...
Laboratorio di Cultura Digitale, labcd.humnet.unipi.it
PPTX
chapter 1 powerpoint presentation for data and analytics
bunnycake1
PPTX
The Use of Data and Datasets in Data Science
Damian T. Gordon
PPTX
Data Analysis
Marcelo Augusto A. Cosgayon
PPTX
Classification of data
Dr. C.V. Suresh Babu
PPTX
Introduction of Data and Type of data in Statstics
Ayushijaiswal709985
PPTX
Advance Data Mining - Machine Learning -
MuhammadHaroon20656
PPTX
Chapter Two - Overview o g yuyjkgftdrrgty yufguif Data Science.pptx
TemesgenAsmamaw4
PPTX
advance data Science-Introduction to Statistics
shraddhahajari0
PPTX
Introduction to Analytics - Data
Lee Schlenker
PPTX
academic-style write-up on the types of data
AminuAbubakarHassan
PDF
Fundamentals of data science: digital data
lokeshsd14
PDF
Data Mining - Introduction and Data
Dar鱈o Garigliotti
SOCIAL ISSUES DISCUSSION You are required to identify any curr.docx
pbilly1
Artificial Intelligence - Data Analysis, Creative & Critical Thinking and AI...
Deboshree Chatterjee
Types of Research Data.pptx
GERLIETAGALO1
What is Data?
Ranjit Nambisan
Data mining Basics and complete description
Sulman Ahmed
Introduction to Data (1).pptx
SubhamitaKanungo
Four data types Data Scientist should know
Ranjit Nambisan
Data collection, Data Integration, Data Understanding e Data Cleaning & Prepa...
Laboratorio di Cultura Digitale, labcd.humnet.unipi.it
chapter 1 powerpoint presentation for data and analytics
bunnycake1
The Use of Data and Datasets in Data Science
Damian T. Gordon
Classification of data
Dr. C.V. Suresh Babu
Introduction of Data and Type of data in Statstics
Ayushijaiswal709985
Advance Data Mining - Machine Learning -
MuhammadHaroon20656
Chapter Two - Overview o g yuyjkgftdrrgty yufguif Data Science.pptx
TemesgenAsmamaw4
advance data Science-Introduction to Statistics
shraddhahajari0
Introduction to Analytics - Data
Lee Schlenker
academic-style write-up on the types of data
AminuAbubakarHassan
Fundamentals of data science: digital data
lokeshsd14
Data Mining - Introduction and Data
Dar鱈o Garigliotti

More from lovepreet33653 (8)

PPTX
CAQA5e_ch2.pptx memory hierarchy design storage
lovepreet33653
PPT
Intro Ch 06A.ppt operating system of computer
lovepreet33653
PPT
Robot PPT.ppt this will define the robots
lovepreet33653
PPTX
MODERN DATABASES (2).pptx in which modern types of data bases
lovepreet33653
PPTX
komal (distance and similarity measure).pptx
lovepreet33653
PPT
ch6.ppt operating System batch Processing
lovepreet33653
PPT
Scheduling.ppt with operating system slides
lovepreet33653
PPT
Operating System CPU Scheduling slide with OS
lovepreet33653
CAQA5e_ch2.pptx memory hierarchy design storage
lovepreet33653
Intro Ch 06A.ppt operating system of computer
lovepreet33653
Robot PPT.ppt this will define the robots
lovepreet33653
MODERN DATABASES (2).pptx in which modern types of data bases
lovepreet33653
komal (distance and similarity measure).pptx
lovepreet33653
ch6.ppt operating System batch Processing
lovepreet33653
Scheduling.ppt with operating system slides
lovepreet33653
Operating System CPU Scheduling slide with OS
lovepreet33653
Ad

Recently uploaded (20)

PPTX
darshai cross section and river section analysis
muk7971
PPTX
Diabetes diabetes diabetes diabetes jsnsmxndm
130SaniyaAbduNasir
PPSX
OOPS Concepts in Python and Exception Handling
Dr. A. B. Shinde
PDF
LLC CM NCP1399 SIMPLIS MODEL MANUAL.PDF
ssuser1be9ce
PDF
methodology-driven-mbse-murphy-july-hsv-huntsville6680038572db67488e78ff00003...
henriqueltorres1
PPTX
Distribution reservoir and service storage pptx
dhanashree78
PDF
Artificial Neural Network-Types,Perceptron,Problems
Sharmila Chidaravalli
PPTX
UNIT 1 - INTRODUCTION TO AI and AI tools and basic concept
gokuld13012005
PPT
Footbinding.pptmnmkjkjkknmnnjkkkkkkkkkkkkkk
mamadoundiaye42742
PPTX
Introduction to Internal Combustion Engines - Types, Working and Camparison.pptx
UtkarshPatil98
PDF
MODULE-5 notes [BCG402-CG&V] PART-B.pdf
Alvas Institute of Engineering and technology, Moodabidri
PDF
William Stallings - Foundations of Modern Networking_ SDN, NFV, QoE, IoT, and...
lavanya896395
PDF
20ES1152 Programming for Problem Solving Lab Manual VRSEC.pdf
Ashutosh Satapathy
PPTX
Precooling and Refrigerated storage.pptx
ThongamSunita
PDF
Bayesian Learning - Naive Bayes Algorithm
Sharmila Chidaravalli
PPTX
OCS353 DATA SCIENCE FUNDAMENTALS- Unit 1 Introduction to Data Science
A R SIVANESH M.E., (Ph.D)
PPTX
Functions in Python Programming Language
BeulahS2
PPTX
Seminar Description: YOLO v1 (You Only Look Once).pptx
abhijithpramod20002
PPTX
Fundamentals of Quantitative Design and Analysis.pptx
aliali240367
PPTX
Unit_I Functional Units, Instruction Sets.pptx
logaprakash9
darshai cross section and river section analysis
muk7971
Diabetes diabetes diabetes diabetes jsnsmxndm
130SaniyaAbduNasir
OOPS Concepts in Python and Exception Handling
Dr. A. B. Shinde
LLC CM NCP1399 SIMPLIS MODEL MANUAL.PDF
ssuser1be9ce
methodology-driven-mbse-murphy-july-hsv-huntsville6680038572db67488e78ff00003...
henriqueltorres1
Distribution reservoir and service storage pptx
dhanashree78
Artificial Neural Network-Types,Perceptron,Problems
Sharmila Chidaravalli
UNIT 1 - INTRODUCTION TO AI and AI tools and basic concept
gokuld13012005
Footbinding.pptmnmkjkjkknmnnjkkkkkkkkkkkkkk
mamadoundiaye42742
Introduction to Internal Combustion Engines - Types, Working and Camparison.pptx
UtkarshPatil98
MODULE-5 notes [BCG402-CG&V] PART-B.pdf
Alvas Institute of Engineering and technology, Moodabidri
William Stallings - Foundations of Modern Networking_ SDN, NFV, QoE, IoT, and...
lavanya896395
20ES1152 Programming for Problem Solving Lab Manual VRSEC.pdf
Ashutosh Satapathy
Precooling and Refrigerated storage.pptx
ThongamSunita
Bayesian Learning - Naive Bayes Algorithm
Sharmila Chidaravalli
OCS353 DATA SCIENCE FUNDAMENTALS- Unit 1 Introduction to Data Science
A R SIVANESH M.E., (Ph.D)
Functions in Python Programming Language
BeulahS2
Seminar Description: YOLO v1 (You Only Look Once).pptx
abhijithpramod20002
Fundamentals of Quantitative Design and Analysis.pptx
aliali240367
Unit_I Functional Units, Instruction Sets.pptx
logaprakash9
Ad

Data Exploration and Transformation.pptx

  • 2. Structured data oStructured data is data whose elements are addressable for effective analysis. oIt has been organized into a formatted repository that is typically a database. oIt concerns all data which can be stored in database SQL in a table with rows and columns. oThey have relational keys and can easily be mapped into pre-designed fields. oThose data are most processed in the development and simplest way to manage information. o Example: Relational data.
  • 3. Example of Structured Data Figure 1 shows customer data of Your Model Car, using a spreadsheet as an example of structured data. The tabular form and inherent structure make this type of data analysis-ready, e.g. we could use a computer to filter the table for customers living in the USA (the data is machine-readable). Typically, structured data is stored in spreadsheets (e.g. Excel files) or in relational databases. These formats also happen to be pretty human-readable as figure 1 shows. However, this is not always necessarily the case. Another common storage format of structured data are comma separated value files (CSV). Figure 2 shows structured data in csv format.
  • 4. Pros and Cons of structured data Pros of structured data There are three key benefits of structured data: 1. Easily used by machine learning algorithms 2. Easily used by business users 3. Increased access to more tools Cons of structured data The cons of structured data are centered in a lack of data flexibility. Here are some potential drawbacks to structured datas use: 1. A predefined purpose limits use 2. Limited storage options
  • 5. Structured data tools OLAP: Performs high-speed, multidimensional data analysis from unified, centralized data stores. SQLite: Implements a self-contained, serverless, zero-configuration, transactional relational database engine. MySQL: Embeds data into mass-deployed software, particularly mission-critical, heavy- load production system. PostgreSQL: Supports SQL and JSON querying as well as high-tier programming languages (C/C+, Java, Python, etc.).
  • 6. Unstructured data oUnstructured data is a data which is not organized in a predefined manner or does not have a predefined data model, thus it is not a good fit for a mainstream relational database. oSo for Unstructured data, there are alternative platforms for storing and managing, it is increasingly prevalent in IT systems and is used by organizations in a variety of business intelligence and analytics applications. Example: Word, PDF, Text, Media logs. oThe vast majority of all data created today is unstructured. Just think of all the text, chat, video and audio content that is generated every day around the world! Unstructured data is typically easy to consume for us humans (e.g. images, videos and PDF-documents). But due to the lack of organization in the data, it is very cumbersome or even impossible for a computer to make sense of it.
  • 7. Unstructured data examples There is a plethora of examples of unstructured data. Just think of any image (e.g. jpeg), video (e.g. mp4), song (e.g. mp3), documents (e.g. PDFs or docx) or any other file type. The image below shows just one concrete example of unstructured data: a product image and description text. Even though this type of data might be easy to consume for us humans, it has no degree of organization and is therefore difficult for machines to analyses and interpret.
  • 8. Pros and cons of unstructured data Pros of unstructured data As there are pros and cons of structured data, unstructured data also has strengths and weaknesses for specific business needs. Some of its benefits include: 1. Freedom of the native format 2. Faster accumulation rates 3. Data lake storage Cons of unstructured data There are also cons to using unstructured data. It requires specific expertise and specialized tools in order to be used to its fullest potential. 1. Requires data science expertise 2. Specialized tools
  • 9. Unstructured data tools MongoDB: Uses flexible documents to process data for cross-platform applications and services. DynamoDB: Delivers single-digit millisecond performance at any scale via built-in security, in-memory caching and backup and restore. Hadoop: Provides distributed processing of large data sets using simple programming models and no formatting requirements. Azure: Enables agile cloud computing for creating and managing apps through Microsofts data centers.
  • 11. Qualitative data Qualitative data is descriptive and conceptual. Qualitative data can be categorized based on traits and characteristics. Qualitative data is non-statistical and is typically unstructured in nature. This data isnt necessarily measured using hard numbers used to develop graphs and charts. Instead, it is categorized based on properties, attributes, labels, and other identifiers. Qualitative data can be used to ask the question why. It is investigative and is often open- ended until further research is conducted. Generating this data from qualitative research is used for theorizations, interpretations, developing hypotheses, and initial understandings. Qualitative data can be generated through: Texts and documents Audio and video recordings Images and symbols Interview transcripts and focus groups Observations and notes
  • 12. Pros and cons of Qualitative data Pros Better understanding Provides Explanation Better Identification of behavior patterns Cons Lesser reachability Time Consuming Possibility of Bias
  • 13. Quantitative data Contrary to qualitative data, quantitative data is statistical and is typically structured in nature meaning it is more rigid and defined. This type of data is measured using numbers and values, which makes it a more suitable candidate for data analysis. Whereas qualitative is open for exploration, quantitative data is much more concise and close-ended. It can be used to ask the questions how much or how many, followed by conclusive information. Quantitative data can be generated through: Tests Experiments Surveys Market reports Metrics
  • 14. Pros and Cons of Quantitative data Pros Specific High Reliability Easy communication Existing support Cons Limited Options High Complexity Require Expertise
  • 15. Four Levels of data Measurement The way a set of data is measured is called its level of measurement. Correct statistical procedures depend on a researcher being familiar with levels of measurement. Not every statistical operation can be used with every set of data. Data can be classified into four levels of measurement. They are (from lowest to highest level): 1) Nominal level 2) Ordinal level 3) Interval level 4) Ratio level
  • 16. Nominal Level Data that is measured using a nominal scale is qualitative. Categories, colors, names, labels and favorite foods along with yes or no responses are examples of nominal level data. Nominal scale data are not ordered. Nominal scale data cannot be used in calculations. Example: 1.To classify people according to their favorite food, like pizza, spaghetti, and sushi. Putting pizza first and sushi second is not meaningful. 2.Smartphone companies are another example of nominal scale data. Some examples are Sony, Motorola, Nokia, Samsung and Apple. This is just a list and there is no agreed upon order. Some people may favor Apple but that is a matter of opinion.
  • 17. Ordinal Level Data that is measured using an ordinal scale is similar to nominal scale data but there is a big difference. The ordinal scale data can be ordered. Like the nominal scale data, ordinal scale data cannot be used in calculations. Example: 1.A list of the top five national parks in the United States. The top five national parks in the United States can be ranked from one to five but we cannot measure differences between the data. 2.A cruise survey where the responses to questions about the cruise are excellent, good, satisfactory, and unsatisfactory. These responses are ordered from the most desired response to the least desired. But the differences between two pieces of data cannot be measured.
  • 18. Interval Scale Level Data that is measured using the interval scale is similar to ordinal level data because it has a definite ordering but there is a difference between data. The differences between interval scale data can be measured though the data does not have a starting point. Temperature scales like Celsius (C) and Fahrenheit (F) are measured by using the interval scale. In both temperature measurements, 40属 is equal to 100属 minus 60属. Differences make sense. But 0 degrees does not because, in both scales, 0 is not the absolute lowest temperature. Temperatures like -10属 F and -15属 C exist and are colder than 0. Interval level data can be used in calculations, but comparison cannot be done. 80属 C is not four times as hot as 20属 C (nor is 80属 F four times as hot as 20属 F). There is no meaning to the ratio of 80 to 20 (or four to one). Example: 1.Monthly income of 2000 part-time students in Texas 2.Highest daily temperature in Odessa
  • 19. Ratio Scale Level Data that is measured using the ratio scale takes care of the ratio problem and gives you the most information. Ratio scale data is like interval scale data, but it has a 0 point and ratios can be calculated. You will not have a negative value in ratio scale data. For example, four multiple choice statistics final exam scores are 80, 68, 20 and 92 (out of a possible 100 points) (given that the exams are machine-graded.) The data can be put in order from lowest to highest: 20, 68, 80, 92. There is no negative point in the final exam scores as the lowest score is 0 point. The differences between the data have meaning. The score 92 is more than the score 68 by 24 points. Ratios can be calculated. The smallest score is 0. So 80 is four times 20. If one student scores 80 points and another student scores 20 points, the student who scores higher is 4 times better than the student who scores lower. Example: 1.Weight of 200 cancer patients in the past 5 months 2.Height of 549 newborn babies 3.Diameter of 150 donuts
  • 20. Data Cleaning Data cleaning is the process of preparing data for analysis by removing or modifying data that is incorrect, incomplete, irrelevant, duplicated, or improperly formatted. This data is usually not necessary or helpful when it comes to analyzing data because it may hinder the process or provide inaccurate results. There are several methods for cleaning data depending on how it is stored along with the answers being sought. Data cleaning is not simply about erasing information to make space for new data, but rather finding a way to maximize a data sets accuracy without necessarily deleting information.
  • 21. How do you clean data? Step 1: Remove duplicate or irrelevant observations Step 2: Fix structural errors Step 3: Filter unwanted outliers Step 4: Handle missing data Step 4: Validate

Editor's Notes

  • #2: Structured data typically categorized as quantitative data is highly organized and easily decipherable by油machine learning algorithms.油Developed by IBM in 1974, structured query language (SQL) is the programming language used to manage structured data. By using a油relational (SQL) database, business users can quickly input, search and manipulate structured data.
  • #4: Easily used by machine learning (ML) algorithms:油The specific and organized architecture of structured data eases manipulation and querying of ML data. Easily used by business users:油Structured data does not require an in-depth understanding of different types of data and how they function. With a basic understanding of the topic relative to the data, users can easily access and interpret the data. Accessible by more tools:油Since structured data predates unstructured data, there are more tools available for using and analyzing structured data. Limited usage:油Data with a predefined structure can only be used for its intended purpose, which limits its flexibility and usability. Limited storage options:油Structured data is generally stored in data storage systems with rigid schemas (e.g., data warehouses). Therefore, changes in data requirements necessitate an update of all structured data, which leads to a massive expenditure of time and resources. Structured data tools
  • #8: Pros Native format:油Unstructured data, stored in its native format, remains undefined until needed. Its adaptability increases file formats in the database, which widens the data pool and enables data scientists to prepare and analyze only the data they need. Fast accumulation rates:油Since there is no need to predefine the data, it can be collected quickly and easily. Data lake storage:油Allows for massive storage and pay-as-you-use pricing, which cuts costs and eases scalability. Cons Requires expertise:油Due to its undefined/non-formatted nature,油data science油expertise is required to prepare and analyze unstructured data. This is beneficial to data analysts but alienates unspecialized business users who may not fully understand specialized data topics or how to utilize their data. Specialized tools:油Specialized tools are required to manipulate unstructured data, which limits product choices for data managers.