際際滷

際際滷Share a Scribd company logo
Schema From Nothing
Transformation of JSON into PostgreSQL
9.4 SQL+JSONB Power Schema
Christopher W.V. Hogue, Ph.D.
cwvhogue@gmail.com @cwvhogue
From JSON to Data Warehouse
? SQL Normalization
You stay in DevOpsLand
And I show you how deep
The Rabbit Hole Goes.
? NoSQL Solution
This story ends.
You wake up in your bed
and believe whatever you
want to believe.
cwvhogue@gmail.com @cwvhogue
From JSON to Data Warehouse
? 1st Normal Form
? 2nd Normal Form
? 3rd Normal Form
? BCNF
? Denormalization
? Star Schema
? Materialized Views
? Non-unique Indexes
Normalization:
Like a splinter in your
mind, driving you mad.
Schema:
A prison for your mind.cwvhogue@gmail.com @cwvhogue
JSON as The Matrix.
cwvhogue@gmail.com @cwvhogue
Power Law C Power Schema
Sparse
Edge
Repetitive
Edge
cwvhogue@gmail.com @cwvhogue
Power Schema
? Sparse Table
C Simple key-value store
C Flexible, agile
C Add anything you like,
later.
cwvhogue@gmail.com @cwvhogue
Power Schema
? Sparse Table
C Simple key-value store
? Main Table
C SQL typed columns
C JSONB columns
cwvhogue@gmail.com @cwvhogue
Power Schema
? Sparse Table
C Simple key-value store
? Main Table
? Repetitive Table
C Hashed lookup
C Unique column
combinations
cwvhogue@gmail.com @cwvhogue
The Matrix, Reloaded
The Power Schema
Sparse Table
Main Table
Repetitive Table
Sample C Measure Information
cwvhogue@gmail.com @cwvhogue
Scalable Classifier Heuristic
cwvhogue@gmail.com @cwvhogue
Classifying JSON
cwvhogue@gmail.com @cwvhogue
JSONB Implementation
? PostgreSQL 9.4b introduces JSONB
C JSONB makes each SQL column behave like a
NoSQL JSON document store.
? Blue Pill Solution:
C Jam JSON into ONE JSONB column, done.
? Red Pill Solution:
C Realize that PostgreSQL is strongly typed.
C JSONB CASTs will fail at query time.
C Detect types, and CAST at LOAD Time.
cwvhogue@gmail.com @cwvhogue
JSONB Implementation
? PostgreSQL 9.4b introduces JSONB
C JSONB makes each SQL column behave like a
NoSQL JSON document store.
? Blue Pill Solution:
C Jam JSON into ONE JSONB column, done.
? Red Pill Solution:
C Know that PostgreSQL columns are strongly typed.
C JSONB CASTs may fail at query time. {^time ̄ : ^default ̄}
C Detect types, and CAST before LOADing.
cwvhogue@gmail.com @cwvhogue
Inferring Types from JSON
? Detect column type information
C Streaming Decision Tree &
`Meta¨ Finite State Machine (node.js)
? Use type information to
C Generate SQL table structure and types
C Convert JSON into TSV for fast loading
cwvhogue@gmail.com @cwvhogue
Detect Which Types?
? Numeric types
? VARCHAR
? Nested JSON
? IP addresses
? Mac addresses
? Timestamps
? Arrays (of the above)
Convert these to strict
PostgreSQL types when
possible prior to batch
loading.
Especially JSON arrays!
cwvhogue@gmail.com @cwvhogue
JSON ETL steps to TSV
? JSON
C Sample & Classify
C Split into 3 tables
? _m
? _r
? _s
C Detect types in each
table, each column.
? Tab Separated Value
C PostgreSQL load form
C Fast parser
C Column-type rules
? Type detection step
information used to
generate painless TSV
load form AND SQL
table declaration
cwvhogue@gmail.com @cwvhogue
JSON ETL steps to TSV
? JSON
C Sample & Classify
C Split into 3 tables
? _m
? _r
? _s
C Detect types in each
table, each column.
? Tab Separated Value
C PostgreSQL load form
C Fast parser
C Column-type rules
? Type detection step
information for
C painless TSV load form
C SQL table declaration
cwvhogue@gmail.com @cwvhogue
Load in to PostgreSQL 9.4b
? COPY table_m FROM ¨file.tsv¨;
? Query with SQL + JSONB operators with fewer
CASTS and validation errors at query time.
? A Tableau, R, ODBC, JDBC, ready
analytical data source.
? Your JSON is now in SQL.
cwvhogue@gmail.com @cwvhogue
Repo (In progress)
https://github.com/joyent/moray-etl-jsonb
Email: cwvhogue@gmail.com
Acknowledgements: @benr
cwvhogue@gmail.com @cwvhogue

More Related Content

141115 dev opsdays_jso_ntosql

  • 1. Schema From Nothing Transformation of JSON into PostgreSQL 9.4 SQL+JSONB Power Schema Christopher W.V. Hogue, Ph.D. cwvhogue@gmail.com @cwvhogue
  • 2. From JSON to Data Warehouse ? SQL Normalization You stay in DevOpsLand And I show you how deep The Rabbit Hole Goes. ? NoSQL Solution This story ends. You wake up in your bed and believe whatever you want to believe. cwvhogue@gmail.com @cwvhogue
  • 3. From JSON to Data Warehouse ? 1st Normal Form ? 2nd Normal Form ? 3rd Normal Form ? BCNF ? Denormalization ? Star Schema ? Materialized Views ? Non-unique Indexes Normalization: Like a splinter in your mind, driving you mad. Schema: A prison for your mind.cwvhogue@gmail.com @cwvhogue
  • 4. JSON as The Matrix. cwvhogue@gmail.com @cwvhogue
  • 5. Power Law C Power Schema Sparse Edge Repetitive Edge cwvhogue@gmail.com @cwvhogue
  • 6. Power Schema ? Sparse Table C Simple key-value store C Flexible, agile C Add anything you like, later. cwvhogue@gmail.com @cwvhogue
  • 7. Power Schema ? Sparse Table C Simple key-value store ? Main Table C SQL typed columns C JSONB columns cwvhogue@gmail.com @cwvhogue
  • 8. Power Schema ? Sparse Table C Simple key-value store ? Main Table ? Repetitive Table C Hashed lookup C Unique column combinations cwvhogue@gmail.com @cwvhogue
  • 9. The Matrix, Reloaded The Power Schema Sparse Table Main Table Repetitive Table
  • 10. Sample C Measure Information cwvhogue@gmail.com @cwvhogue
  • 13. JSONB Implementation ? PostgreSQL 9.4b introduces JSONB C JSONB makes each SQL column behave like a NoSQL JSON document store. ? Blue Pill Solution: C Jam JSON into ONE JSONB column, done. ? Red Pill Solution: C Realize that PostgreSQL is strongly typed. C JSONB CASTs will fail at query time. C Detect types, and CAST at LOAD Time. cwvhogue@gmail.com @cwvhogue
  • 14. JSONB Implementation ? PostgreSQL 9.4b introduces JSONB C JSONB makes each SQL column behave like a NoSQL JSON document store. ? Blue Pill Solution: C Jam JSON into ONE JSONB column, done. ? Red Pill Solution: C Know that PostgreSQL columns are strongly typed. C JSONB CASTs may fail at query time. {^time ̄ : ^default ̄} C Detect types, and CAST before LOADing. cwvhogue@gmail.com @cwvhogue
  • 15. Inferring Types from JSON ? Detect column type information C Streaming Decision Tree & `Meta¨ Finite State Machine (node.js) ? Use type information to C Generate SQL table structure and types C Convert JSON into TSV for fast loading cwvhogue@gmail.com @cwvhogue
  • 16. Detect Which Types? ? Numeric types ? VARCHAR ? Nested JSON ? IP addresses ? Mac addresses ? Timestamps ? Arrays (of the above) Convert these to strict PostgreSQL types when possible prior to batch loading. Especially JSON arrays! cwvhogue@gmail.com @cwvhogue
  • 17. JSON ETL steps to TSV ? JSON C Sample & Classify C Split into 3 tables ? _m ? _r ? _s C Detect types in each table, each column. ? Tab Separated Value C PostgreSQL load form C Fast parser C Column-type rules ? Type detection step information used to generate painless TSV load form AND SQL table declaration cwvhogue@gmail.com @cwvhogue
  • 18. JSON ETL steps to TSV ? JSON C Sample & Classify C Split into 3 tables ? _m ? _r ? _s C Detect types in each table, each column. ? Tab Separated Value C PostgreSQL load form C Fast parser C Column-type rules ? Type detection step information for C painless TSV load form C SQL table declaration cwvhogue@gmail.com @cwvhogue
  • 19. Load in to PostgreSQL 9.4b ? COPY table_m FROM ¨file.tsv¨; ? Query with SQL + JSONB operators with fewer CASTS and validation errors at query time. ? A Tableau, R, ODBC, JDBC, ready analytical data source. ? Your JSON is now in SQL. cwvhogue@gmail.com @cwvhogue
  • 20. Repo (In progress) https://github.com/joyent/moray-etl-jsonb Email: cwvhogue@gmail.com Acknowledgements: @benr cwvhogue@gmail.com @cwvhogue

Editor's Notes

  • #2: Title:?Schema?from Nothing: Transformation of JSON into PostgreSQL 9.4 SQL+JSONB Power?Schema. Speaker Name: Christopher W.V. Hogue, Ph.D., Senior Data Scientist, Joyent Inc. Abstract: The new PostgreSQL 9.4b JSONB feature allows us to re-imagine a strategy for transforming arbitrarily structured JSON key-value data into an SQL queryable form. An automated approach to a new normalization strategy has been developed which generates mixed columns of native SQL types and JSONB. This approach is implemented in a set of open source tools used to transform several JSON sets stored in key-value systems and combine them into an SQL data warehouse for cloud operations reporting at Joyent. The transformation handles the two edge cases of sparse and repetitive JSON data using a generic table structure called a Power?Schema. Classification of JSON keys into the Power?Schema?columns uses a heuristic based on Shannon Information. The Power?Schema?table structure maintains the flexibility of JSON and?schema-free input data by allowing storage of unanticipated JSON key-value pairs that may later appear in data. Base SQL data types are inferred with streaming tools that aggregate type and size information across the input JSON, and this can operate in Map/Reduce mode. A hierarchy of types promotes JSON singleton or array columns into native PostgreSQL types containing integers, floating point numbers, UUIDs, timestamps, mac addresses, and IP addresses. These are preferred to JSON array forms according to a fail-early philosophy: base SQL types are validated and type cast at load time, whereas the contents of JSONB typed JSON arrays are not - leading to late failures at query time. The transformed reporting database integrates several separate JSON key-value stores and offers straightforward connectivity and query with SQL-based dashboarding and reporting tools, including R and Tableau.? Twitter Handle: @cwvhogue
  • #3: SQL Normalization is set up to deal with anomalies in transaction processes. Prevent data inconsistency. Transactions done in JSON C machine generated logs, system updates, system state information, time series state, current state. Conventionally, the data modeler should fully normalize, then denormalize for performance. What to do when there is no a-priori data model, no normalization, just piles of JSON?
  • #4: SQL Normalization is set up to deal with anomalies in transaction processes. Prevent data inconsistency. Transactions done in JSON C machine generated logs, system updates, system state information, time series state, current state. Conventionally, the data modeler should fully normalize, then denormalize for performance. What to do when there is no a-priori data model, no normalization, just piles of JSON?
  • #5: Toy example that is in fact generalizable C single characters C but could be ANY type of data
  • #6: Zipfian, Pareto, don¨t care what kind of power-law it is In general collections of data have common varying elements, repetitive elements and rare elements.
  • #10: This effects a data compression scheme, removed tags, repeating values, and mostly-null columns...