This document discusses the importance of data quality for organizations. It notes that many organizations struggle with issues like not being able to perform root cause analysis on failures due to poor quality data. The document defines data quality as the degree to which data fulfills its intended purpose. It discusses common data quality issues like missing context, non-uniform definitions, and under-estimating impact. The document also outlines a conceptual data quality framework and discusses ensuring data quality across the data pipeline from collection to downstream use.
1 of 18
Download to read offline
More Related Content
Data Quality_ the holy grail for a Data Fluent Organization.pptx
2. 2
Balvinder has 15 years of experience in building large-scale custom software and
big data platform solutions for complicated client problems. She has extensive
experience in Analysis, Design, Architecture,
and Development of Web based Enterprise systems and Analytical systems using
Agile practices like Scrum and XP.
Balvinder currently works as a Data Architect and Global Data Community Lead for
Thoughtworks
Data Architect
Balvinder Khurana
A little bit
about me..
3. We often hear organisations complaining about
the same things
We are not able to do the RCA of failures with the available data
We do not know if we can monetize our data
Our assortment team doesnt trust the data our platform is providing and they are still using their old
Excel-based mechanism to do assortment planning
Often our POS systems go down and we lose an entire chunk of data
We cant use the data we have to build a credit scoring algorithm since our existing data has many
income groups missing
5. Systemic Data
Quality Issues
Addressing data quality issues late in the process
Quite a lot of time gets spent in addressing data quality issues in downstream systems or
data platforms as opposed to in source systems.
Missing context
As we move downstream, context gets lost and addressing some data quality issues leads to
further more data quality issues.
Non-uniform definitions
The redressal for data quality issues isnt often agreed upon with various teams and across
organisation which leads to trust issues in the underlying data.
Point solutions
Data quality gets looked at from the lens of the viewer, thereby causing myopic solutions
that are tactical in nature but dont address the root cause.
Lack of strategy
Data quality is addressed tactically, and not as an integrated process or framework in the
entire ecosystem of products and platform of an organisation.
Under-estimating the impact
Data quality issues not only affect the downstream systems such as BI/Predictive dashboard,
but are a big reason for teams losing trust on the data platform and hence, become an
impediment to change management.
6. DEFINING DATA QUALITY
Data quality refers to the ability of a given set of
data to fulfill an intended purpose.
It is the degree to which a set of inherent characteristics fulfill the
requirements of a system, determine the fitness for use of the data and
ensure its conformance to the requirements.
10. Big Data ecosystems bring in additional complexities
Volume Variety Velocity Veracity
How do we have a
comprehensive data quality
control for PBs of data
How do we cater to multiple
types of data - structured,
semi-structured and
unstructured
How do we have a data
quality measure in time to
cater for high velocity
How to handle inherent
impreciseness and
uncertainty
12. How do we validate the success of our solution?
How do we validate and measure the correctness of the prices you recommend?
How do we validate our analytics accuracy?
How do we provide more transparency into data quality at every transformation stage in the data
pipeline for the development teams?
How do we establish trust with data and insights that I am provisioning to my business teams?
How do we enable teams to discover and use the data that is being collected in various systems?
How do I ensure legal and regulatory compliance?
Who is responsible for ensuring data quality within various systems?
12
Example - Pricing for a Retailer
13. Baseline data quality / sensible defaults
KPIs and
dashboards
Rules execution
engine
Rules authoring
Fit for purpose
data quality
Reports/
alerts
Fit for purpose
data quality
ML
algorithms
Fit for purpose
data quality
Ad-hoc
analysis
Fit for purpose
data quality
Preventive
and
corrective
action
Fit for purpose
data quality
Downstream
systems
Intermediate data quality
Metrics
definition
Metrics
definition
Critical
path
Critical
path
Interface to enable quick
discovery and navigation of right
dataset
Data Discovery
Metadata Ingestion/updation by
APIs such as Business Glossary,
Technical Metadata, Lineage etc
Metadata Service
Metadata Repositories e.g.
schemas/relations/lineage/
indexing services
Repository & Indexing
Service
Owners and SLA/SLO/SLI to
ensure Data Quality, for each
layer, including the business
process
Ownership of DQ
Data Quality Framework
14. Domain Data Products and Data Quality
Article
Fixed values of
Category
Article Price
Article Price can not be
negative
Unique Price point per
article per channel
Sales
Total amount can be
negative(Returns)
Competitor
Prices
Multiple Price points
per article per channel
Legacy Data
warehouse
Modern
Pricing System
POS / Online
Sales
Surveys/ Web
Crawlers
Dynamic
Pricing
algorithm
Article
Price &
Sales
There should be
no outliers in price
Reports
Discoverable
Addressable
Self-describing
Trustworthy
Interoperable
Secure
15. PRE-ETL VALIDATIONS
Format
Consistency
Completeness
Domain
Timeliness
POST-ETL & PRE-
SIMULATION VALIDATIONS
Meta data
Data Transformation
Data Completeness
Business specific
Scope
Joins
Data copy
SIMULATION
VALIDATIONS
Model Validation
Implementation
Computation
AGGREGATION
VALIDATION
Hierarchy
Data Scope
Summarized values
UI VALIDATIONS
Representation
Format
Intuitive
Data Quality across the pipeline
#11: Volume:comprehensive data quality assessment is not possible.The data quality measures are approximate
define in terms of probability and confidence intervals
Have a clear metric and metric definition for data quality
Variety: Data is also being collected from external sources
1) data sets from the internet and mobile internet 2) data from the Internet of Things; 3) data collected by various industries; 4) scientific experimental and observational data
Velocity: Need to have data quality measures which are relevant as well as feasible
Sampling, data quality on fly, structural validations instead of semantic
Veracity: How to make sure the trustworthiness of source of data, else, such data might skew your data quality report
#13: The client is a huge retailer and has reached out to you to help them price their entire assortment of articles based on number of data points that they collect, what is the demand of any product, what is the competitor price for same product, does the product have any seasonal value..
SLO/SLA/Governance teams
Business are losing trust in data
How to I ascertain my Data Quality
How much to invest on data quality assurance
Untrustworthy results or inaccurate insights from analytics were due to a lack of quality in the data fed into systems such as AI and machine learning
#14: Data Quality framework
hierarchical data quality framework from the perspective of data users. This framework consists of big data quality dimensions, quality characteristics, and quality indexes
ROI of data quality
Define, Measure, Analyze, Design/Improve, and Verify/Control
#15: Data users and data providers are often different organizations with very different goals and operational procedures. Thus, it is no surprise that their notions of data quality are very different. In many cases, the data providers have no clue about the business use cases of data users (data providers might not even care about it, unless they are getting paid for the data). This disconnect between data source and data use is one of the prime reasons behind the data quality issues.
#17: Plan:
Planning (or designing) phase consists of defining scope & business need, identifying stakeholders, clarifying business rules for data, and identifying business processes. The outcome of the planning phase should clearly communicate to relevant senior management as well as other stakeholders the objectives of the DQ work.
Assess:
This phase measures the existing data with respect to business policies, data standards, and business practices. Profiling is a key component of this phase and of course a lot has been written about profiling & assessment.
Analyze:
Typically, we use both quantitative and qualitative analytical techniques to do gap analysis of where the data quality should be based on whats defined in planning phase and where the data quality actually is.
Pilot:
There may be variations in how different organizations deal with Pilot and Deploy phases but we recommend a Piloting phase to focus on specific actions needed to improve the data quality. Piloting phase might also identify any business processes that need to be adjusted to improve data quality on a sustaining basis.
Deploy:
Based on the outcomes of pilot phase, Deploy phase should focus on both business and technical solutions to improve data quality. The tendency of many organizations is to focus on technical solutions only and ignore business solutions but in our opinion, it is a major mistake.
Maintain:
It is very important to make sure that processes and control mechanisms should be put in place to maintain the data quality efforts on an ongoing basis. Data Governance will play an important role in making sure that data quality is maintained for a sustaining program.