This document proposes an Augmented Time (AT) approach as an alternative to Spanner's True Time (TT) for ordering events in a distributed system. AT uses vector clocks where each node tracks the estimated clock of other nodes it has communicated with recently. This allows ordering events without specialized clock synchronization hardware. Transactions can be executed without delay and snapshot reads in the past are supported by analyzing the vector clocks to identify the latest version. The AT approach is wait-free, has higher throughput than TT, and does not require dedicated GPS/atomic clock hardware, instead using moderate uncertainty from NTP.
2. Outline
Spanner True Time (TT)
Clock uncertainty
Issues:
Special hardware responsible for maintaining tightly
synchronized clock
Transaction delays
Proposed Augmented-Time (AT)
3. Spanner & True Time
(in Review)
Spanner read and writes:
T1, T2 two transactions
If ph.T1<ph.T2, ts.T1 < ts.T2
TT API:
TT.now() TT interval [earliest, latest]
Error bound: 竜=(latest earliest)/2
Error bound < 6ms (using special purpose time
master machines)
4. Implementation
A zone masterEach zone
(a unit of admin.
deployment)
100-1000 span servers
The zone master assigns
data to span servers
Span servers serve
data to clients
Each span-server is responsible to 100-1000
tablets (a bag of mapping)
(key:string, timestamp:int64) string
5. Paxos state machines
Implemented on top of each tablet to support
replication
Leader replica lock table (ranges of keys to lock
states)
Transaction manager to support distribued
transactions
A Transaction involves
Only 1 Paxos group:
Bypass transaction manager
More than 1 Paxos group:
Perform 2 phase commit
7. Transaction Types
Read-write:
Use a 2-phase locking and 2-phase commit
Client issues a read to the leader of the appropriate group
Acquires read locks and reads the most recent data
2-phase commit
Leader assign transaction timestamp when all the locks are
acquired (but not released yet) si=TT.now().latest
Commit-wait for the uncertainty in TT:
The leader blocks clients from seeing data committed by Ti until si <
TT.now().earliest
Read-only:
Assign a timestmp s_read (s_read = TT.now.latest)
Execuate a snapshot read at s_read without locking
8. Causality in Distributed Systems
Approaches for ordering events:
1. Ignore wall clocks (assuming infinite uncertainty), and track
events using logical clocks
vector clocks: e f iff vc.e < vc.f
2. Use a dedicated time-synchronization machine, order
events by wall-clock time
Pros and Cons:
Vector clocks: Size of VC can be too large
Spanner TrueTime:
No communication required
Commit-wait for uncertainty
Snapshot read:
Reading a consistent-cut in the past,
very difficult to achieve without TT
9. Augmented Time (AT)
at.j is a vector:
at.j[k] knowledge that j has about wallclock of k
Same update rule as VC
Update at.j[j] by the wallclock of j
Synchronization assumption: at.j[k] cannot be less than at.j[j]-竜
No need to keep track of at.j[k] if no communication between j & k
within the last 竜 time
Limiting the size of at.j to only those that communicated with j in
the last 竜 time
竜 small
TT
竜 large
VC
竜
竜 small
TT
竜 small
TT
10. Augmented Time (AT)
Integrating AT into Spanner's tablets:
Backward compatible
(key:string, timestamp:TTstamp) 'string , at.j'
11. Transactions with AT
Read/Write Transactions:
Similar to TT execution, except that leader's
timestamp is AT (a vector) instead of a single TT
No delay
Snapshot Reads:
Similar to TT
Two cases for reading x with latest updated time tx
1. Latest update to x is older than 竜 trivial case
2. x has another update within tx-竜
12. Snapshot Read:
handling overlapping uncertainty intervals
Issue due to no commit-wait
Another update at t'x:
order events at tx and t'x according to AT
Identify the latest version of x
This ordering is not possible with single TT
13. Discussion
AT is wait-free higher throughout
Hidden backchannel dependencies
Can be resolved by client-notification-wait not
reducing write throughput
No need for dedicated GPS and atomic clocks
use NTP with moderate uncertainty
14. Take Home Message
TT:
Demands highly synchronized systems
(with GPS and atomic clocks)
Provides snapshot reads in the past
Has to wait for the uncertainty interval
If keeping vector clocks, size of vectors will be too large
AT:
Size of vector is reasonably small
No need to wait for the uncertainty
Higher write throughout