際際滷

際際滷Share a Scribd company logo
Lock-free transactional
  support for large-scale
     storage systems
Flavio Junqueira, Benjamin Reed, Maysam Yabandeh

                Yahoo! Research
                    June 2011
Big data


   Large data sets
     Unstructured, semi-structured data
     Critical for business logic

   Examples of such data
     Web logs, server logs, social media, etc


                          June 2011              2
Big data

+160% clicks                                                      +43% clicks
vs. one-size 鍖ts all                                             vs. editor selected




                                     Eric Baldeschwieler @IBM Big Data, May 2011

                         June 2011                                            3
Big data: Hadoop




                  Eric Baldeschwieler @IBM Big Data, May 2011

      June 2011                                           4
Background

       Database generations in batches
Input                                                              e.g.,
            Input      Hours of    Output       Hours of   Output Hbase,
Input        DB       MapReduce     DB         MapReduce    DB    HDFS


       Online concurrent updates
         Input txn
                     Output       Require transactional
         Input txn    DB          support


                                   June 2011                           5
Examples

   Mutable tables

   Various indexes: Web, news, shopping, coupons

   User and content models

   Characteristics
     Concurrency

     Losing updates is undesirable

     There are concurrent reads and they must be consistent


                              June 2011                        6
Semantics

   Read only previously committed values
                                         Txn
        w(x,v)

                   r(x) = v


                    w(x,v)       Time



                      June 2011                7
Semantics

   No concurrent writes to the same row
                                                 Txn

          w(x,v)


                   w(x,v)            Time

                                  At least one must abort
                      June 2011                         8
Snapshot Isolation

   Known in the database realm
   Con鍖icting transactions
     Write to the same element (e.g., row)
     Time range between start and commit overlap

   Ef鍖cient implementation by versioning


                        June 2011                   9
Locks?

   Previous approaches: Lock data to modify
     Convoy effect                           [Percolator, OSDI10]

     Delays of several seconds
     Higher overhead on data servers

   Our approach
     Lock-free, centralized transaction manager
     Single point of failure, potential bottleneck?

                          June 2011                            10
Transaction Status Oracle
   Single process
     Processes client inquiries about transactions
     Includes a timestamp oracle
                                           Keeps state
                  TO         TSO           about committed
                                           rows




DB1                      Client                              DB2


                             June 2011                             11
Transaction Status Oracle
   Single process
     Processes client inquiries about transactions
     Includes a timestamp oracle
                                           Keeps state
                  TO         TSO           about committed
                                           rows


                    Ts


DB1                      Client                              DB2


                             June 2011                             11
Transaction Status Oracle
   Single process
     Processes client inquiries about transactions
     Includes a timestamp oracle
                                                   Keeps state
                     TO         TSO                about committed
                                                   rows


                       Ts

        w(r2, v2, Ts(txnr))                         r(r1)
DB1           ACK             Client          v1, Ts(txnw),         DB2
                                            Ts(txnw) < Ts(txnr)

                                June 2011                                  11
Transaction Status Oracle
   Single process
     Processes client inquiries about transactions
     Includes a timestamp oracle
                                                   Keeps state
                     TO         TSO                about committed
                                                   rows


                       Ts                     Tc(txnw) < Ts(txnr)?

        w(r2, v2, Ts(txnr))                         r(r1)
DB1           ACK             Client          v1, Ts(txnw),         DB2
                                            Ts(txnw) < Ts(txnr)

                                June 2011                                  11
Transaction Status Oracle
   Single process
     Processes client inquiries about transactions
     Includes a timestamp oracle
                                                   Keeps state
                     TO         TSO                about committed
                                                   rows


                       Ts      Commit r2      Tc(txnw) < Ts(txnr)?

        w(r2, v2, Ts(txnr))                         r(r1)
DB1           ACK             Client          v1, Ts(txnw),         DB2
                                            Ts(txnw) < Ts(txnr)

                                June 2011                                  11
Transaction Status Oracle
   Single process
     Processes client inquiries about transactions
     Includes a timestamp oracle
                                                   Keeps state
                     TO         TSO                about committed
                                                   rows


                       Ts      Commit r2      Tc(txnw) < Ts(txnr)?

        w(r2, v2, Ts(txnr))                         r(r1)
DB1           ACK             Client          v1, Ts(txnw),         DB2
                                            Ts(txnw) < Ts(txnr)
         Cleanup(r2, txnr)

                                June 2011                                  11
ReTSO: Design choices

   TSO
     Keeps state of modi鍖ed rows

   In-memory state
     Highest commit timestamp of all garbage-collected rows

   Auto-GC Hash map
     Lazy garbage-collection
     Upon a hit

                           June 2011                           12
ReTSO: Increasing dependability


          Remote write-ahead log

             Inquiries                            Backup
                           ReTSO                  ReTSO
                                                         Warm or cold

                         Updates

    Writes to WAL
are synchronous but do        WAL   e.g., NFS, BookKeeper
  not block other txns
                                                  [http://zookeeper.apache.org/bookkeeper]

                                      June 2011                                          13
Preliminary results

   Coded in Java
     Except for hash map (C++ with JNI interface)

   Uses BookKeeper for WAL
   10 identical servers
     2.13 Dual Core Intel Xeon
     4GB of RAM
     1 Gigabit interfaces

                             June 2011               14
Preliminary results

   Average throughput observed
     3 clients, 1,000 concurrent transactions
     81k TPS

   Average latency
     1 client, 1 txn
     0.87 ms (with WAL)
     0.17 ms (without WAL)

                         June 2011               15
Preliminary results

   Increasing the load of the system
     1 to 16 clients                    18
                                                      ReTSO
                                         16      WAL-disabled

     Max is 72k TPS                     14
                                         12


                         Latency in ms
                                         10
                                          8
                                          6
                                          4
                                          2
                                         0
                                         20000      40000       60000   80000   100000   120000
                                                            Throughput in TPS




                        June 2011                                                                 16
Whats baking?


   Integration
     HBase
     Query engine

   Real workloads



                     June 2011   17
Summary

   Transaction management for large-scale data
    repositories
   Lock-based vs. Lock-free
     ReTSO is lock-free and dependable
     Reduced load on storage nodes
     Low latency despite faults

   Performance suf鍖cient for realistic applications

                           June 2011                   18

More Related Content

Retso hotdep-2011

  • 1. Lock-free transactional support for large-scale storage systems Flavio Junqueira, Benjamin Reed, Maysam Yabandeh Yahoo! Research June 2011
  • 2. Big data Large data sets Unstructured, semi-structured data Critical for business logic Examples of such data Web logs, server logs, social media, etc June 2011 2
  • 3. Big data +160% clicks +43% clicks vs. one-size 鍖ts all vs. editor selected Eric Baldeschwieler @IBM Big Data, May 2011 June 2011 3
  • 4. Big data: Hadoop Eric Baldeschwieler @IBM Big Data, May 2011 June 2011 4
  • 5. Background Database generations in batches Input e.g., Input Hours of Output Hours of Output Hbase, Input DB MapReduce DB MapReduce DB HDFS Online concurrent updates Input txn Output Require transactional Input txn DB support June 2011 5
  • 6. Examples Mutable tables Various indexes: Web, news, shopping, coupons User and content models Characteristics Concurrency Losing updates is undesirable There are concurrent reads and they must be consistent June 2011 6
  • 7. Semantics Read only previously committed values Txn w(x,v) r(x) = v w(x,v) Time June 2011 7
  • 8. Semantics No concurrent writes to the same row Txn w(x,v) w(x,v) Time At least one must abort June 2011 8
  • 9. Snapshot Isolation Known in the database realm Con鍖icting transactions Write to the same element (e.g., row) Time range between start and commit overlap Ef鍖cient implementation by versioning June 2011 9
  • 10. Locks? Previous approaches: Lock data to modify Convoy effect [Percolator, OSDI10] Delays of several seconds Higher overhead on data servers Our approach Lock-free, centralized transaction manager Single point of failure, potential bottleneck? June 2011 10
  • 11. Transaction Status Oracle Single process Processes client inquiries about transactions Includes a timestamp oracle Keeps state TO TSO about committed rows DB1 Client DB2 June 2011 11
  • 12. Transaction Status Oracle Single process Processes client inquiries about transactions Includes a timestamp oracle Keeps state TO TSO about committed rows Ts DB1 Client DB2 June 2011 11
  • 13. Transaction Status Oracle Single process Processes client inquiries about transactions Includes a timestamp oracle Keeps state TO TSO about committed rows Ts w(r2, v2, Ts(txnr)) r(r1) DB1 ACK Client v1, Ts(txnw), DB2 Ts(txnw) < Ts(txnr) June 2011 11
  • 14. Transaction Status Oracle Single process Processes client inquiries about transactions Includes a timestamp oracle Keeps state TO TSO about committed rows Ts Tc(txnw) < Ts(txnr)? w(r2, v2, Ts(txnr)) r(r1) DB1 ACK Client v1, Ts(txnw), DB2 Ts(txnw) < Ts(txnr) June 2011 11
  • 15. Transaction Status Oracle Single process Processes client inquiries about transactions Includes a timestamp oracle Keeps state TO TSO about committed rows Ts Commit r2 Tc(txnw) < Ts(txnr)? w(r2, v2, Ts(txnr)) r(r1) DB1 ACK Client v1, Ts(txnw), DB2 Ts(txnw) < Ts(txnr) June 2011 11
  • 16. Transaction Status Oracle Single process Processes client inquiries about transactions Includes a timestamp oracle Keeps state TO TSO about committed rows Ts Commit r2 Tc(txnw) < Ts(txnr)? w(r2, v2, Ts(txnr)) r(r1) DB1 ACK Client v1, Ts(txnw), DB2 Ts(txnw) < Ts(txnr) Cleanup(r2, txnr) June 2011 11
  • 17. ReTSO: Design choices TSO Keeps state of modi鍖ed rows In-memory state Highest commit timestamp of all garbage-collected rows Auto-GC Hash map Lazy garbage-collection Upon a hit June 2011 12
  • 18. ReTSO: Increasing dependability Remote write-ahead log Inquiries Backup ReTSO ReTSO Warm or cold Updates Writes to WAL are synchronous but do WAL e.g., NFS, BookKeeper not block other txns [http://zookeeper.apache.org/bookkeeper] June 2011 13
  • 19. Preliminary results Coded in Java Except for hash map (C++ with JNI interface) Uses BookKeeper for WAL 10 identical servers 2.13 Dual Core Intel Xeon 4GB of RAM 1 Gigabit interfaces June 2011 14
  • 20. Preliminary results Average throughput observed 3 clients, 1,000 concurrent transactions 81k TPS Average latency 1 client, 1 txn 0.87 ms (with WAL) 0.17 ms (without WAL) June 2011 15
  • 21. Preliminary results Increasing the load of the system 1 to 16 clients 18 ReTSO 16 WAL-disabled Max is 72k TPS 14 12 Latency in ms 10 8 6 4 2 0 20000 40000 60000 80000 100000 120000 Throughput in TPS June 2011 16
  • 22. Whats baking? Integration HBase Query engine Real workloads June 2011 17
  • 23. Summary Transaction management for large-scale data repositories Lock-based vs. Lock-free ReTSO is lock-free and dependable Reduced load on storage nodes Low latency despite faults Performance suf鍖cient for realistic applications June 2011 18