際際滷

際際滷Share a Scribd company logo
DATA QUALITY
THE HOLY GRAIL OF A DATA FLUENT ORGANIZATION
Balvinder Khurana
2
Balvinder has 15 years of experience in building large-scale custom software and
big data platform solutions for complicated client problems. She has extensive
experience in Analysis, Design, Architecture,
and Development of Web based Enterprise systems and Analytical systems using
Agile practices like Scrum and XP.
Balvinder currently works as a Data Architect and Global Data Community Lead for
Thoughtworks
Data Architect
Balvinder Khurana
A little bit
about me..
We often hear organisations complaining about
 the same things
We are not able to do the RCA of failures with the available data
We do not know if we can monetize our data
Our assortment team doesnt trust the data our platform is providing and they are still using their old
Excel-based mechanism to do assortment planning
Often our POS systems go down and we lose an entire chunk of data
We cant use the data we have to build a credit scoring algorithm since our existing data has many
income groups missing
Garbage in,
Garbage Out!
Your analysis is as good
as the underlying data.
Systemic Data
Quality Issues
Addressing data quality issues late in the process
Quite a lot of time gets spent in addressing data quality issues in downstream systems or
data platforms as opposed to in source systems.
Missing context
As we move downstream, context gets lost and addressing some data quality issues leads to
further more data quality issues.
Non-uniform definitions
The redressal for data quality issues isnt often agreed upon with various teams and across
organisation which leads to trust issues in the underlying data.
Point solutions
Data quality gets looked at from the lens of the viewer, thereby causing myopic solutions
that are tactical in nature but dont address the root cause.
Lack of strategy
Data quality is addressed tactically, and not as an integrated process or framework in the
entire ecosystem of products and platform of an organisation.
Under-estimating the impact
Data quality issues not only affect the downstream systems such as BI/Predictive dashboard,
but are a big reason for teams losing trust on the data platform and hence, become an
impediment to change management.
DEFINING DATA QUALITY
Data quality refers to the ability of a given set of
data to fulfill an intended purpose.
It is the degree to which a set of inherent characteristics fulfill the
requirements of a system, determine the fitness for use of the data and
ensure its conformance to the requirements.
7
DATA
QUALITY
Uniqueness
Integrity
Consistency
Trustworthiness Standardisation
Usability
Availability
Reliability
Relevance
Class Balance
Multidimensionality in Data Quality
 Accuracy
 Integrity
 Consistency
 Completeness
 Auditability
8
Uniqueness
Reliability
Consistency
Standardisation
Trustworthiness
Usability
Availability
Integrity
Relevance
Class Balance
DATA
QUALITY
 Accessibility
 Timeliness
 Authorization
 Fitness
 Value
 Freshness
 Documentation
 Credibility
 Metadata
 Statistical Bias
 Readability
 Definability
 Referenceability
 Reproducibility
 Interoperable
Multidimensionality and its tenets in Data Quality
Availability Usability Reliability Relevance Standardisation
Accessibility Documentation Accuracy Fitness Definability
Timeliness Credibility Integrity Value Referenceability
Authorization Metadata Consistency Freshness Reproducibility
Statistical Bias Completeness Interoperable
Readability Auditability
Tenets of Data Quality
Big Data ecosystems bring in additional complexities
Volume Variety Velocity Veracity
How do we have a
comprehensive data quality
control for PBs of data
How do we cater to multiple
types of data - structured,
semi-structured and
unstructured
How do we have a data
quality measure in time to
cater for high velocity
How to handle inherent
impreciseness and
uncertainty
Modern Data Platforms - A Conceptual view
How do we validate the success of our solution?
How do we validate and measure the correctness of the prices you recommend?
How do we validate our analytics accuracy?
How do we provide more transparency into data quality at every transformation stage in the data
pipeline for the development teams?
How do we establish trust with data and insights that I am provisioning to my business teams?
How do we enable teams to discover and use the data that is being collected in various systems?
How do I ensure legal and regulatory compliance?
Who is responsible for ensuring data quality within various systems?
12
Example - Pricing for a Retailer
Baseline data quality / sensible defaults
KPIs and
dashboards
Rules execution
engine
Rules authoring
Fit for purpose
data quality
Reports/
alerts
Fit for purpose
data quality
ML
algorithms
Fit for purpose
data quality
Ad-hoc
analysis
Fit for purpose
data quality
Preventive
and
corrective
action
Fit for purpose
data quality
Downstream
systems
Intermediate data quality
Metrics
definition
Metrics
definition
Critical
path
Critical
path
Interface to enable quick
discovery and navigation of right
dataset
Data Discovery
Metadata Ingestion/updation by
APIs such as Business Glossary,
Technical Metadata, Lineage etc
Metadata Service
Metadata Repositories e.g.
schemas/relations/lineage/
indexing services
Repository & Indexing
Service
Owners and SLA/SLO/SLI to
ensure Data Quality, for each
layer, including the business
process
Ownership of DQ
Data Quality Framework
Domain Data Products and Data Quality
Article
Fixed values of
Category
Article Price
Article Price can not be
negative
Unique Price point per
article per channel
Sales
Total amount can be
negative(Returns)
Competitor
Prices
Multiple Price points
per article per channel
Legacy Data
warehouse
Modern
Pricing System
POS / Online
Sales
Surveys/ Web
Crawlers
Dynamic
Pricing
algorithm
Article
Price &
Sales
There should be
no outliers in price
Reports
Discoverable
Addressable
Self-describing
Trustworthy
Interoperable
Secure
PRE-ETL VALIDATIONS
Format
Consistency
Completeness
Domain
Timeliness
POST-ETL & PRE-
SIMULATION VALIDATIONS
Meta data
Data Transformation
Data Completeness
Business specific
Scope
Joins
Data copy
SIMULATION
VALIDATIONS
Model Validation
Implementation
Computation
AGGREGATION
VALIDATION
Hierarchy
Data Scope
Summarized values
UI VALIDATIONS
Representation
Format
Intuitive
Data Quality across the pipeline
Goals
of data
collecting
Determining
quality
dimensions
Determining
indicators/
KPIs
Formulating
evaluation
baseline
Data
analysis
and data
mining
Data
cleaning
Output
results
Output
data
Data quality
assessment
Generating
data quality
report
New goals
Quick pilot*
Satisfy
goals?
Data
collection
Yes
*Improve
data quality
17
Mitigate
Prioritize
Quantify
Identify
Operationalising
Data Quality
Thank You!
Reach out to us:
@Balvinder

More Related Content

Data Quality_ the holy grail for a Data Fluent Organization.pptx

  • 1. DATA QUALITY THE HOLY GRAIL OF A DATA FLUENT ORGANIZATION Balvinder Khurana
  • 2. 2 Balvinder has 15 years of experience in building large-scale custom software and big data platform solutions for complicated client problems. She has extensive experience in Analysis, Design, Architecture, and Development of Web based Enterprise systems and Analytical systems using Agile practices like Scrum and XP. Balvinder currently works as a Data Architect and Global Data Community Lead for Thoughtworks Data Architect Balvinder Khurana A little bit about me..
  • 3. We often hear organisations complaining about the same things We are not able to do the RCA of failures with the available data We do not know if we can monetize our data Our assortment team doesnt trust the data our platform is providing and they are still using their old Excel-based mechanism to do assortment planning Often our POS systems go down and we lose an entire chunk of data We cant use the data we have to build a credit scoring algorithm since our existing data has many income groups missing
  • 4. Garbage in, Garbage Out! Your analysis is as good as the underlying data.
  • 5. Systemic Data Quality Issues Addressing data quality issues late in the process Quite a lot of time gets spent in addressing data quality issues in downstream systems or data platforms as opposed to in source systems. Missing context As we move downstream, context gets lost and addressing some data quality issues leads to further more data quality issues. Non-uniform definitions The redressal for data quality issues isnt often agreed upon with various teams and across organisation which leads to trust issues in the underlying data. Point solutions Data quality gets looked at from the lens of the viewer, thereby causing myopic solutions that are tactical in nature but dont address the root cause. Lack of strategy Data quality is addressed tactically, and not as an integrated process or framework in the entire ecosystem of products and platform of an organisation. Under-estimating the impact Data quality issues not only affect the downstream systems such as BI/Predictive dashboard, but are a big reason for teams losing trust on the data platform and hence, become an impediment to change management.
  • 6. DEFINING DATA QUALITY Data quality refers to the ability of a given set of data to fulfill an intended purpose. It is the degree to which a set of inherent characteristics fulfill the requirements of a system, determine the fitness for use of the data and ensure its conformance to the requirements.
  • 8. Accuracy Integrity Consistency Completeness Auditability 8 Uniqueness Reliability Consistency Standardisation Trustworthiness Usability Availability Integrity Relevance Class Balance DATA QUALITY Accessibility Timeliness Authorization Fitness Value Freshness Documentation Credibility Metadata Statistical Bias Readability Definability Referenceability Reproducibility Interoperable Multidimensionality and its tenets in Data Quality
  • 9. Availability Usability Reliability Relevance Standardisation Accessibility Documentation Accuracy Fitness Definability Timeliness Credibility Integrity Value Referenceability Authorization Metadata Consistency Freshness Reproducibility Statistical Bias Completeness Interoperable Readability Auditability Tenets of Data Quality
  • 10. Big Data ecosystems bring in additional complexities Volume Variety Velocity Veracity How do we have a comprehensive data quality control for PBs of data How do we cater to multiple types of data - structured, semi-structured and unstructured How do we have a data quality measure in time to cater for high velocity How to handle inherent impreciseness and uncertainty
  • 11. Modern Data Platforms - A Conceptual view
  • 12. How do we validate the success of our solution? How do we validate and measure the correctness of the prices you recommend? How do we validate our analytics accuracy? How do we provide more transparency into data quality at every transformation stage in the data pipeline for the development teams? How do we establish trust with data and insights that I am provisioning to my business teams? How do we enable teams to discover and use the data that is being collected in various systems? How do I ensure legal and regulatory compliance? Who is responsible for ensuring data quality within various systems? 12 Example - Pricing for a Retailer
  • 13. Baseline data quality / sensible defaults KPIs and dashboards Rules execution engine Rules authoring Fit for purpose data quality Reports/ alerts Fit for purpose data quality ML algorithms Fit for purpose data quality Ad-hoc analysis Fit for purpose data quality Preventive and corrective action Fit for purpose data quality Downstream systems Intermediate data quality Metrics definition Metrics definition Critical path Critical path Interface to enable quick discovery and navigation of right dataset Data Discovery Metadata Ingestion/updation by APIs such as Business Glossary, Technical Metadata, Lineage etc Metadata Service Metadata Repositories e.g. schemas/relations/lineage/ indexing services Repository & Indexing Service Owners and SLA/SLO/SLI to ensure Data Quality, for each layer, including the business process Ownership of DQ Data Quality Framework
  • 14. Domain Data Products and Data Quality Article Fixed values of Category Article Price Article Price can not be negative Unique Price point per article per channel Sales Total amount can be negative(Returns) Competitor Prices Multiple Price points per article per channel Legacy Data warehouse Modern Pricing System POS / Online Sales Surveys/ Web Crawlers Dynamic Pricing algorithm Article Price & Sales There should be no outliers in price Reports Discoverable Addressable Self-describing Trustworthy Interoperable Secure
  • 15. PRE-ETL VALIDATIONS Format Consistency Completeness Domain Timeliness POST-ETL & PRE- SIMULATION VALIDATIONS Meta data Data Transformation Data Completeness Business specific Scope Joins Data copy SIMULATION VALIDATIONS Model Validation Implementation Computation AGGREGATION VALIDATION Hierarchy Data Scope Summarized values UI VALIDATIONS Representation Format Intuitive Data Quality across the pipeline
  • 16. Goals of data collecting Determining quality dimensions Determining indicators/ KPIs Formulating evaluation baseline Data analysis and data mining Data cleaning Output results Output data Data quality assessment Generating data quality report New goals Quick pilot* Satisfy goals? Data collection Yes *Improve data quality
  • 18. Thank You! Reach out to us: @Balvinder

Editor's Notes

  • #8: 1 min
  • #9: 1 min
  • #11: Volume:comprehensive data quality assessment is not possible.The data quality measures are approximate define in terms of probability and confidence intervals Have a clear metric and metric definition for data quality Variety: Data is also being collected from external sources 1) data sets from the internet and mobile internet 2) data from the Internet of Things; 3) data collected by various industries; 4) scientific experimental and observational data Velocity: Need to have data quality measures which are relevant as well as feasible Sampling, data quality on fly, structural validations instead of semantic Veracity: How to make sure the trustworthiness of source of data, else, such data might skew your data quality report
  • #13: The client is a huge retailer and has reached out to you to help them price their entire assortment of articles based on number of data points that they collect, what is the demand of any product, what is the competitor price for same product, does the product have any seasonal value.. SLO/SLA/Governance teams Business are losing trust in data How to I ascertain my Data Quality How much to invest on data quality assurance Untrustworthy results or inaccurate insights from analytics were due to a lack of quality in the data fed into systems such as AI and machine learning
  • #14: Data Quality framework hierarchical data quality framework from the perspective of data users. This framework consists of big data quality dimensions, quality characteristics, and quality indexes ROI of data quality Define, Measure, Analyze, Design/Improve, and Verify/Control
  • #15: Data users and data providers are often different organizations with very different goals and operational procedures. Thus, it is no surprise that their notions of data quality are very different. In many cases, the data providers have no clue about the business use cases of data users (data providers might not even care about it, unless they are getting paid for the data). This disconnect between data source and data use is one of the prime reasons behind the data quality issues.
  • #17: Plan: Planning (or designing) phase consists of defining scope & business need, identifying stakeholders, clarifying business rules for data, and identifying business processes. The outcome of the planning phase should clearly communicate to relevant senior management as well as other stakeholders the objectives of the DQ work. Assess: This phase measures the existing data with respect to business policies, data standards, and business practices. Profiling is a key component of this phase and of course a lot has been written about profiling & assessment. Analyze: Typically, we use both quantitative and qualitative analytical techniques to do gap analysis of where the data quality should be based on whats defined in planning phase and where the data quality actually is. Pilot: There may be variations in how different organizations deal with Pilot and Deploy phases but we recommend a Piloting phase to focus on specific actions needed to improve the data quality. Piloting phase might also identify any business processes that need to be adjusted to improve data quality on a sustaining basis. Deploy: Based on the outcomes of pilot phase, Deploy phase should focus on both business and technical solutions to improve data quality. The tendency of many organizations is to focus on technical solutions only and ignore business solutions but in our opinion, it is a major mistake. Maintain: It is very important to make sure that processes and control mechanisms should be put in place to maintain the data quality efforts on an ongoing basis. Data Governance will play an important role in making sure that data quality is maintained for a sustaining program.