The document proposes JECB, a new approach to partitioning OLTP databases across multiple nodes. JECB uses join paths between tables defined by foreign key relationships to determine how to partition tables. It analyzes transaction SQL code to identify optimal join trees for partitioning. JECB partitions individual transaction types separately then combines results. Experiments show JECB partitions TPC-C and TPC-E databases with fewer distributed transactions than prior approaches, demonstrating it handles complex workloads better than previous statistics-based or schema-based partitioning methods.
1 of 18
More Related Content
Jecb sigmod2014
1. JECB: a Join-Extension,
Code-Based Approach to
OLTP Data Partitioning
Khai Q. Tran -- Oracle Labs1
Jeffrey F. Naughton UW Madison
Bruhathi Sundarmurthy -- UW Madison
Dimitris Tsirogiannis -- Cloudera2
1
1: Work done at UW Madison
2: Work done at the Microsoft Gray Systems Lab
2. Motivation
Modern OLTP applications often require
high transaction throughput
Scaling out OLTP transactions in
distributed systems is challenging
Communication protocols
(2PC, Paxos) are expensive
A smart way to partition the
database to avoid distributed
transactions is crucial
2
3. The problem
Given a database, a transaction
workload, and k processing elements,
horizontally split the database into k
partitions such that the number of
distributed transactions is minimized.
Tuesday, June 24, 2014 3
4. Work done in the past: parallel
database partitioning
Tuesday, June 24, 2014 4
OLTP Parallel database
Goal
Wants local
execution, avoids
splitting xacts
Wants parallelism
power by splitting
query/xact
Workload
properties
Simple queries,
touching few
records
Complex queries,
large scans
Old techniques may not be
applicable for the new problem
5. State of the art 1 - Schism
A statistic-based fine-granularity
approach:
Models training transactions as a tuple graph:
tuples nodes and transactions edges
connecting the tuples
Uses graph min-cut algorithms to partition the
graph into k partitions
Mapping tuple partition id
Uses machine learning techniques to infer
partitioning attributes from the sample
mapping
May not scale well for large databasesTuesday, June 24, 2014 5
6. State of the art 2 -
Horticulture
A schema-based approach
Uses schema to search for the best
partitioning attribute for each table
Employs an advanced search technique
(large-neighborhood search) to reduce the
search space
Uses training trace and a data skew-aware
cost function to evaluate the partitioning
quality
May not find good partitioning solutions for
complicated workloadsTuesday, June 24, 2014 6
7. Our approach - JECB
Join-extension (idea):
Explicitly uses key-foreign key relationships from
schema to partition through joins
Code-based (algorithm):
Analyze SQL code from store procedures to find the
best way joining tables together
Divide-and-conquer (optimization):
Consider each class of transactions separately and
combine results later
Tuesday, June 24, 2014 7
8. Simple example for Join
Extension
Two tables: R(X, Y) and S(Y, Z)
Query: SELECT * FROM R, S
WHERE R.Y = S.Y AND S.Z = 100
The best way to partition R ?
- Not by any columns of R
- By S.Z and using join R.Y = S.Y
Tuesday, June 24, 2014 8
9. Our data partitioning
framework
Phase 1: Pre-processing
Collecting workload trace (sample run) and
identifying read-only tables
Phase 2: Partitioning individual transaction
types
Finding the best partitioning solution for each
type of transactions using join extensions in
transaction source code
Phase 3: Merging
Combining results from all transaction types to
find the best global solution.
Tuesday, June 24, 2014 9
10. Phase 2 demo on TPC-E Customer-Position
transactions
IF (@cust_id = 0)
SELECT @cust_id = C_ID
FROM CUSTOMER
WHERE C_TAX_ID = @tax_id
SELECT CA_ID, CA_BAL,
SUM(ISNULL(HS_QTY, 0) * ISNULL(LT_PRICE, 0))
FROM CUSTOMER_ACCOUNT WITH (UPDLOCK) LEFT LOOP JOIN
HOLDING_SUMMARY ON CA_ID=HS_CA_ID LEFT LOOP JOIN LAST_TRADE ON HS_S_SYMB = LT_S_SYMB
WHERE CA_C_ID = @cust_id
GROUP BY CA_ID,CA_BAL
ORDER BY 3 ASC
SELECT TOP 30
TH_DTS,
T_QTY,
T_S_SYMB,
T_ID,
ST_NAME
FROM (SELECT TOP 10 T_ID AS ID
FROM TRADE
WHERE T_CA_ID = @acct_id
ORDER BY T_DTS DESC) AS T
JOIN TRADE ON T_ID=ID
INNER LOOP JOIN TRADE_HISTORY ON TH_T_ID=T_ID,
STATUS_TYPE
WHERE ST_ID = TH_ST_ID
ORDER BY TH_DTS DESC
Tuesday, June 24, 2014 10
CA_ID CA_C_ID
CUSTOMER_ACCOUNT
HS_S_SYMB, HS_CA_ID HS_CA_ID
HOLDING_SUMMARY
T_ID T_CA_ID
TRADE
HS_S_SYMB
LT_S_SYMB
LAST_TRADE
C_ID C_TAX_ID
CUSTOMER
TH_T_ID, TH_ST_ID TH_T_ID
TRADE_HISTORY
ST_ID
STATUS
TH_ST_ID
CA_ID CA_C_IDCA_C_ID
C_IDC_ID C_TAX_ID
CUSTOMER
Step 1: Identify accessed tables and accessed attributes from SQL codeStep 2: Connect tables with key-foreign joins to form a join graphStep 3: List all possible join trees from the join graphStep 4: Use the workload trace to select the best join tree
11. Implementation
Tuesday, June 24, 2014 11
Workload
driver
Partitioning
Algorithm
P
Simulation
framework
CostCost Model
Trace
Collector
Training
trace
Database schema
Transaction SQL code
Testing
trace
Partitioning solution
SQL Server 2012
15. What we learnt from
Experiments
Tuesday, June 24, 2014 15
Schema-based approach
Statistics-based
approach
JECB:
Divide-and-conquer and join
extensions to break complex
workloads
Restrictions: all joins are key-
foreign key joins
Workload complexity
SimpleComplicated
Databasesize
SmallLarge
?
16. Conclusion and future work
We developed a promising approach
for partitioning complicated OLTP
databases.
Future work:
Address the data skew problem
Try more complicated cost models rather
than % of distributed transactions
End-to-end performance experiments
Tuesday, June 24, 2014 16
17. Phase 2 (Background)
Join path:
Sequence of key-foreign key joins connecting tables to partitioning attributes
Allow a table to be partitioned by attribute from another table
Example:
Tuesday, June 24, 2014 17
CA_ID C_ID
1 1
2 1
6 2
9 2
T_ID T_CA_ID
1 1
2 1
3 2
4 2
5 6
6 6
7 9
8 9
TRADE
CUSTOMER_ACCOUNT SELECT *
FROM TRADE,
CUSTOMER_ACCOUNT
WHERE T_CA_ID = CA_ID AND
C_ID = @cust_id
Join path from TRADE to C_ID:
T_ID T_CA_ID CA_ID CA_C_ID
TRADE CUSTOMER_ACCOUNT
Blue: source
Red: destination
Join tree:
Combination of join paths sharing the same destination (the root)