際際滷

際際滷Share a Scribd company logo
Beyond True Time
Authors:
M. Demirbas and S. Kulkarni
Presenter: Vahid Mirjalili
http://vahidmirjalili.com
Outline
 Spanner True Time (TT)
 Clock uncertainty
 Issues:
 Special hardware responsible for maintaining tightly
synchronized clock
 Transaction delays
 Proposed Augmented-Time (AT)
Spanner & True Time
(in Review)
 Spanner read and writes:
 T1, T2 two transactions
 If ph.T1<ph.T2, ts.T1 < ts.T2
 TT API:
 TT.now()  TT interval [earliest, latest]
 Error bound: 竜=(latest  earliest)/2
 Error bound < 6ms (using special purpose time
master machines)
Implementation
A zone masterEach zone
(a unit of admin.
deployment)
100-1000 span servers
The zone master assigns
data to span servers
Span servers serve
data to clients
 Each span-server is responsible to 100-1000
tablets (a bag of mapping)
 (key:string, timestamp:int64)  string
Paxos state machines
 Implemented on top of each tablet to support
replication
 Leader replica  lock table (ranges of keys to lock
states)
 Transaction manager  to support distribued
transactions
 A Transaction involves
Only 1 Paxos group:
Bypass transaction manager
More than 1 Paxos group:
Perform 2 phase commit
Types of Transactions
 Read-write
 Read-only
 Snapshot reads
Transaction Types
 Read-write:
 Use a 2-phase locking and 2-phase commit
 Client issues a read to the leader of the appropriate group
 Acquires read locks and reads the most recent data
 2-phase commit
 Leader assign transaction timestamp when all the locks are
acquired (but not released yet) si=TT.now().latest
 Commit-wait for the uncertainty in TT:
 The leader blocks clients from seeing data committed by Ti until si <
TT.now().earliest
 Read-only:
 Assign a timestmp s_read (s_read = TT.now.latest)
 Execuate a snapshot read at s_read without locking
Causality in Distributed Systems
 Approaches for ordering events:
1. Ignore wall clocks (assuming infinite uncertainty), and track
events using logical clocks
vector clocks: e  f iff vc.e < vc.f
2. Use a dedicated time-synchronization machine, order
events by wall-clock time
 Pros and Cons:
 Vector clocks: Size of VC can be too large
 Spanner TrueTime:
 No communication required
 Commit-wait for uncertainty
Snapshot read:
Reading a consistent-cut in the past,
very difficult to achieve without TT
Augmented Time (AT)
 at.j is a vector:
 at.j[k] knowledge that j has about wallclock of k
 Same update rule as VC
 Update at.j[j] by the wallclock of j
 Synchronization assumption: at.j[k] cannot be less than at.j[j]-竜
 No need to keep track of at.j[k] if no communication between j & k
within the last 竜 time
 Limiting the size of at.j to only those that communicated with j in
the last 竜 time
竜 small
TT
竜 large
VC
竜
竜 small
TT
竜 small
TT
Augmented Time (AT)
 Integrating AT into Spanner's tablets:
 Backward compatible
 (key:string, timestamp:TTstamp)  'string , at.j'
Transactions with AT
 Read/Write Transactions:
 Similar to TT execution, except that leader's
timestamp is AT (a vector) instead of a single TT
 No delay
 Snapshot Reads:
 Similar to TT
 Two cases for reading x with latest updated time tx
1. Latest update to x is older than 竜  trivial case
2. x has another update within tx-竜
Snapshot Read:
handling overlapping uncertainty intervals
 Issue due to no commit-wait
 Another update at t'x:
 order events at tx and t'x according to AT
 Identify the latest version of x
 This ordering is not possible with single TT
Discussion
 AT is wait-free  higher throughout
 Hidden backchannel dependencies
 Can be resolved by client-notification-wait  not
reducing write throughput
 No need for dedicated GPS and atomic clocks
 use NTP with moderate uncertainty
Take Home Message
 TT:
 Demands highly synchronized systems
(with GPS and atomic clocks)
 Provides snapshot reads in the past
 Has to wait for the uncertainty interval
 If keeping vector clocks, size of vectors will be too large
 AT:
 Size of vector is reasonably small
 No need to wait for the uncertainty
 Higher write throughout
Questions?

More Related Content

Beyond TrueTime

  • 1. Beyond True Time Authors: M. Demirbas and S. Kulkarni Presenter: Vahid Mirjalili http://vahidmirjalili.com
  • 2. Outline Spanner True Time (TT) Clock uncertainty Issues: Special hardware responsible for maintaining tightly synchronized clock Transaction delays Proposed Augmented-Time (AT)
  • 3. Spanner & True Time (in Review) Spanner read and writes: T1, T2 two transactions If ph.T1<ph.T2, ts.T1 < ts.T2 TT API: TT.now() TT interval [earliest, latest] Error bound: 竜=(latest earliest)/2 Error bound < 6ms (using special purpose time master machines)
  • 4. Implementation A zone masterEach zone (a unit of admin. deployment) 100-1000 span servers The zone master assigns data to span servers Span servers serve data to clients Each span-server is responsible to 100-1000 tablets (a bag of mapping) (key:string, timestamp:int64) string
  • 5. Paxos state machines Implemented on top of each tablet to support replication Leader replica lock table (ranges of keys to lock states) Transaction manager to support distribued transactions A Transaction involves Only 1 Paxos group: Bypass transaction manager More than 1 Paxos group: Perform 2 phase commit
  • 6. Types of Transactions Read-write Read-only Snapshot reads
  • 7. Transaction Types Read-write: Use a 2-phase locking and 2-phase commit Client issues a read to the leader of the appropriate group Acquires read locks and reads the most recent data 2-phase commit Leader assign transaction timestamp when all the locks are acquired (but not released yet) si=TT.now().latest Commit-wait for the uncertainty in TT: The leader blocks clients from seeing data committed by Ti until si < TT.now().earliest Read-only: Assign a timestmp s_read (s_read = TT.now.latest) Execuate a snapshot read at s_read without locking
  • 8. Causality in Distributed Systems Approaches for ordering events: 1. Ignore wall clocks (assuming infinite uncertainty), and track events using logical clocks vector clocks: e f iff vc.e < vc.f 2. Use a dedicated time-synchronization machine, order events by wall-clock time Pros and Cons: Vector clocks: Size of VC can be too large Spanner TrueTime: No communication required Commit-wait for uncertainty Snapshot read: Reading a consistent-cut in the past, very difficult to achieve without TT
  • 9. Augmented Time (AT) at.j is a vector: at.j[k] knowledge that j has about wallclock of k Same update rule as VC Update at.j[j] by the wallclock of j Synchronization assumption: at.j[k] cannot be less than at.j[j]-竜 No need to keep track of at.j[k] if no communication between j & k within the last 竜 time Limiting the size of at.j to only those that communicated with j in the last 竜 time 竜 small TT 竜 large VC 竜 竜 small TT 竜 small TT
  • 10. Augmented Time (AT) Integrating AT into Spanner's tablets: Backward compatible (key:string, timestamp:TTstamp) 'string , at.j'
  • 11. Transactions with AT Read/Write Transactions: Similar to TT execution, except that leader's timestamp is AT (a vector) instead of a single TT No delay Snapshot Reads: Similar to TT Two cases for reading x with latest updated time tx 1. Latest update to x is older than 竜 trivial case 2. x has another update within tx-竜
  • 12. Snapshot Read: handling overlapping uncertainty intervals Issue due to no commit-wait Another update at t'x: order events at tx and t'x according to AT Identify the latest version of x This ordering is not possible with single TT
  • 13. Discussion AT is wait-free higher throughout Hidden backchannel dependencies Can be resolved by client-notification-wait not reducing write throughput No need for dedicated GPS and atomic clocks use NTP with moderate uncertainty
  • 14. Take Home Message TT: Demands highly synchronized systems (with GPS and atomic clocks) Provides snapshot reads in the past Has to wait for the uncertainty interval If keeping vector clocks, size of vectors will be too large AT: Size of vector is reasonably small No need to wait for the uncertainty Higher write throughout