際際滷 from Vancouver DevOpsDays talk.
Schema from Nothing: Transformation of JSON into PostgreSQL 9.4 SQL+JSONB Power Schema
Video is https://www.youtube.com/watch?v=JnNWn3BoAcM#t=10588
1 of 20
Download to read offline
More Related Content
141115 dev opsdays_jso_ntosql
1. Schema From Nothing
Transformation of JSON into PostgreSQL
9.4 SQL+JSONB Power Schema
Christopher W.V. Hogue, Ph.D.
cwvhogue@gmail.com @cwvhogue
2. From JSON to Data Warehouse
? SQL Normalization
You stay in DevOpsLand
And I show you how deep
The Rabbit Hole Goes.
? NoSQL Solution
This story ends.
You wake up in your bed
and believe whatever you
want to believe.
cwvhogue@gmail.com @cwvhogue
3. From JSON to Data Warehouse
? 1st Normal Form
? 2nd Normal Form
? 3rd Normal Form
? BCNF
? Denormalization
? Star Schema
? Materialized Views
? Non-unique Indexes
Normalization:
Like a splinter in your
mind, driving you mad.
Schema:
A prison for your mind.cwvhogue@gmail.com @cwvhogue
4. JSON as The Matrix.
cwvhogue@gmail.com @cwvhogue
5. Power Law C Power Schema
Sparse
Edge
Repetitive
Edge
cwvhogue@gmail.com @cwvhogue
6. Power Schema
? Sparse Table
C Simple key-value store
C Flexible, agile
C Add anything you like,
later.
cwvhogue@gmail.com @cwvhogue
7. Power Schema
? Sparse Table
C Simple key-value store
? Main Table
C SQL typed columns
C JSONB columns
cwvhogue@gmail.com @cwvhogue
8. Power Schema
? Sparse Table
C Simple key-value store
? Main Table
? Repetitive Table
C Hashed lookup
C Unique column
combinations
cwvhogue@gmail.com @cwvhogue
13. JSONB Implementation
? PostgreSQL 9.4b introduces JSONB
C JSONB makes each SQL column behave like a
NoSQL JSON document store.
? Blue Pill Solution:
C Jam JSON into ONE JSONB column, done.
? Red Pill Solution:
C Realize that PostgreSQL is strongly typed.
C JSONB CASTs will fail at query time.
C Detect types, and CAST at LOAD Time.
cwvhogue@gmail.com @cwvhogue
14. JSONB Implementation
? PostgreSQL 9.4b introduces JSONB
C JSONB makes each SQL column behave like a
NoSQL JSON document store.
? Blue Pill Solution:
C Jam JSON into ONE JSONB column, done.
? Red Pill Solution:
C Know that PostgreSQL columns are strongly typed.
C JSONB CASTs may fail at query time. {^time ̄ : ^default ̄}
C Detect types, and CAST before LOADing.
cwvhogue@gmail.com @cwvhogue
15. Inferring Types from JSON
? Detect column type information
C Streaming Decision Tree &
`Meta¨ Finite State Machine (node.js)
? Use type information to
C Generate SQL table structure and types
C Convert JSON into TSV for fast loading
cwvhogue@gmail.com @cwvhogue
16. Detect Which Types?
? Numeric types
? VARCHAR
? Nested JSON
? IP addresses
? Mac addresses
? Timestamps
? Arrays (of the above)
Convert these to strict
PostgreSQL types when
possible prior to batch
loading.
Especially JSON arrays!
cwvhogue@gmail.com @cwvhogue
17. JSON ETL steps to TSV
? JSON
C Sample & Classify
C Split into 3 tables
? _m
? _r
? _s
C Detect types in each
table, each column.
? Tab Separated Value
C PostgreSQL load form
C Fast parser
C Column-type rules
? Type detection step
information used to
generate painless TSV
load form AND SQL
table declaration
cwvhogue@gmail.com @cwvhogue
18. JSON ETL steps to TSV
? JSON
C Sample & Classify
C Split into 3 tables
? _m
? _r
? _s
C Detect types in each
table, each column.
? Tab Separated Value
C PostgreSQL load form
C Fast parser
C Column-type rules
? Type detection step
information for
C painless TSV load form
C SQL table declaration
cwvhogue@gmail.com @cwvhogue
19. Load in to PostgreSQL 9.4b
? COPY table_m FROM ¨file.tsv¨;
? Query with SQL + JSONB operators with fewer
CASTS and validation errors at query time.
? A Tableau, R, ODBC, JDBC, ready
analytical data source.
? Your JSON is now in SQL.
cwvhogue@gmail.com @cwvhogue
#2: Title:?Schema?from Nothing: Transformation of JSON into PostgreSQL 9.4 SQL+JSONB Power?Schema.
Speaker Name: Christopher W.V. Hogue, Ph.D., Senior Data Scientist, Joyent Inc.
Abstract: The new PostgreSQL 9.4b JSONB feature allows us to re-imagine a strategy for transforming arbitrarily structured JSON key-value data into an SQL queryable form. An automated approach to a new normalization strategy has been developed which generates mixed columns of native SQL types and JSONB. This approach is implemented in a set of open source tools used to transform several JSON sets stored in key-value systems and combine them into an SQL data warehouse for cloud operations reporting at Joyent. The transformation handles the two edge cases of sparse and repetitive JSON data using a generic table structure called a Power?Schema. Classification of JSON keys into the Power?Schema?columns uses a heuristic based on Shannon Information. The Power?Schema?table structure maintains the flexibility of JSON and?schema-free input data by allowing storage of unanticipated JSON key-value pairs that may later appear in data. Base SQL data types are inferred with streaming tools that aggregate type and size information across the input JSON, and this can operate in Map/Reduce mode. A hierarchy of types promotes JSON singleton or array columns into native PostgreSQL types containing integers, floating point numbers, UUIDs, timestamps, mac addresses, and IP addresses. These are preferred to JSON array forms according to a fail-early philosophy: base SQL types are validated and type cast at load time, whereas the contents of JSONB typed JSON arrays are not - leading to late failures at query time. The transformed reporting database integrates several separate JSON key-value stores and offers straightforward connectivity and query with SQL-based dashboarding and reporting tools, including R and Tableau.?
Twitter Handle: @cwvhogue
#3: SQL Normalization is set up to deal with anomalies in transaction processes. Prevent data inconsistency.
Transactions done in JSON C machine generated logs, system updates, system state information, time series state, current state.
Conventionally, the data modeler should fully normalize, then denormalize for performance.
What to do when there is no a-priori data model, no normalization, just piles of JSON?
#4: SQL Normalization is set up to deal with anomalies in transaction processes. Prevent data inconsistency.
Transactions done in JSON C machine generated logs, system updates, system state information, time series state, current state.
Conventionally, the data modeler should fully normalize, then denormalize for performance.
What to do when there is no a-priori data model, no normalization, just piles of JSON?
#5: Toy example that is in fact generalizable C single characters C but could be ANY type of data
#6: Zipfian, Pareto, don¨t care what kind of power-law it is
In general collections of data have common varying elements, repetitive elements and rare elements.
#10: This effects a data compression scheme, removed tags, repeating values, and mostly-null columns...