際際滷

際際滷Share a Scribd company logo
It's Time to Start Caring About
Data Quality
Data Quality at Scale
Ignacio Elola
Everyone is talking about how
useful data is
data can save your business
data can save your life
2015 - Extract SF - Data Quality
but...
all that is only true if you have the
right data
data tend to be dirty and
unstructured
specially web data!
2015 - Extract SF - Data Quality
Lets start simple
Ive created an extractor
2015 - Extract SF - Data Quality
Ive pass a bunch of queries (bulk)
2015 - Extract SF - Data Quality
2015 - Extract SF - Data Quality
and got a dataset
2015 - Extract SF - Data Quality
How can you QA this data?
eyeballing
2015 - Extract SF - Data Quality
2015 - Extract SF - Data Quality
eyeballing we can find anomalies
without having domain expertise
Quick summary:
- create extractors
- combine extractors
- schedule data extraction
What if we need to scale up?
if you have:
- more than ~3 datasources
- more than ~2 extractors per ds
- big volume of queries
- pre or post processing
you will need:
- people to create and maintain
extractors
- process to clean and validate
data
Data Quality
think about it pre and post data
extraction!
tips and tricks to increase data
quality
XPaths
2015 - Extract SF - Data Quality
2015 - Extract SF - Data Quality
//div[@id="priceBlock"]/table/tbody/tr/td
[b/@class="priceLarge"]/b
better than
//*[@id="priceBlock"]/table/tbody/tr[2]/td[2]/b[1]
Regex
2015 - Extract SF - Data Quality
2015 - Extract SF - Data Quality
More at:
http://support.import.
io/knowledgebase/articles/341182-xpaths-regex
http://www.w3schools.com/xsl/xpath_intro.asp
Required column
2015 - Extract SF - Data Quality
2015 - Extract SF - Data Quality
2015 - Extract SF - Data Quality
2015 - Extract SF - Data Quality
measuring data quality
2015 - Extract SF - Data Quality
completeness
coverage
2015 - Extract SF - Data Quality
post extraction data quality
improvements?
2015 - Extract SF - Data Quality
how we do it
2015 - Extract SF - Data Quality
2015 - Extract SF - Data Quality
Smart automation
anomaly detection
variance, variability, noise
normalization
confidence score
Human input
Transparency
summary
Data Quality is essential
think about it from the very
beginning
develop a process to measure
data quality before scaling up
if you dont want to reinvent the
wheel - contact us!
Thank you
ignacio.elola@import.io

More Related Content

2015 - Extract SF - Data Quality