際際滷

際際滷Share a Scribd company logo
Active-Active Multi-Region Architecture
Considerations
David Rostcheck
Active-Active Architecture Overview
the most sophisticated cloud operations pattern
Pros:
- Provides DR and HA both at once
- Fast recovery from problems
- Highly efficient use of spend
Active-Active Architecture - Cons
- Requires focused approach and deep thought
- If implemented poorly, can decrease reliability
- Usually requires application changes
Main challenges in Active-Active
- Failover: How do you decide a region is degraded
- Recovery: How do you decide a region is healthy
- Data replication between regions
- Avoiding coupling between regions
Data replication
Data Replication
Do you absolutely need to do it?
think this over carefully
for each data element
Can you live with eventual consistency between regions?
(HINT: the answer had better be yes)
unless you are prepared to live with very slow transactions
not negotiable. its physics
Data replication  Special relativity
Regions are separated by distances significant enough that
speed of light delay becomes relevant to effect your
application.
Believe it or not, this becomes a Special Relativity problem.
What? Like Albert Einstein 1905 Special Relativity?
Yes.
Frames of Reference
Special relativity deals with the
special case where two frames
are not accelerating relative to
one another
In relativity, observers live in frames of reference
General relativity deals with the
more general case where they
are
Distance matters
If the observers
are close
together, we can
ignore relativity.
If they are separated far enough that
speed of light becomes significant for
measurements, we cant anymore
80 ms
Involuntary regional time travel
Region A Region B
80 ms 80 ms
To ask a question and get a response (ex. whats the current
batch number?) takes 160ms - and the answer is 80ms old
Region A always sees a view
of Region B that is 80ms in the
past (and vice-versa)
Synchronous Replication (is slooooowwww)
If youre starting to think This will be a problem for my
application, youre right.
How do you get what a value is now?
You cant. Since you can only see into the other regions
past, the best you can do is send it a message to freeze any
local updates until you tell it to resume and send you the
current value, then tell it when youre done.
Hold up
If synchronous replication across regions is
starting to sound like a bad idea
that can significantly delay transactions in both regions and
could rapidly snowball into a mess,
youre getting the idea.
The best solution is to not do it
Break the dependency
We need to go back to the data and re-organize to eliminate
the need to coordinate between regions
Lets explore how:
Example - Credit card processing application
Single region
us-east-1
API Gateway ECS Amazon DynamoDB
Payment
Processor
Store network
Client
Stores send in credit
card transactions.
Container tasks
communicate with an
external payment
processor and
coordinate the
current batch ID for a
store via a
DynamoDB table
Example - Credit card processing application
Multi-region active-active
us-east-1
API Gateway ECS Amazon DynamoDB
Payment
Processor
Stores distribute
transactions between
regions.
DynamoDB global
tables can sync the
data between the
regions, but only
eventually due to the
speed-of-light delay 
no longer sufficient for
coordinating batch ID
across all the ECS
tasks
us-west-2
API Gateway ECS Amazon DynamoDB
Store network
Client
80 ms
delay
Solution #1  separate payment processors
us-east-1
API Gateway ECS Amazon DynamoDB
Payment
Processor #1
Regions talk to
separate instances of
the payment
processor (or
different payment
processors
completely)  need to
coordinate across
regions is eliminated
us-west-2
API Gateway ECS Amazon DynamoDB
Store network
Client Payment
Processor #2
Solution #2  separate batch sequences
Multi-region active-active
us-east-1
API Gateway ECS Amazon DynamoDB
Payment
Processor
East and West
regions use even and
odd batch
sequences, so can
never conflict
A store can have
multiple batches
open at one time,
one in East and one
in West
us-west-2
API Gateway ECS Amazon DynamoDB
Store network
Client
Batch 2, 4, 6
Batch 1, 3, 5
Solution #3  geographic preference
us-east-1
API Gateway ECS Amazon DynamoDB
Payment
Processor
Store network
Client
Stores assigned to
groups by
geography. Normally
send to primary
region, but can
switch to secondary
region if needed
Pros: low store-to-
region latency
Cons: load
distribution not as
perfect, edge cases
on failover/back
us-west-2
API Gateway ECS Amazon DynamoDB
Client
West stores
East stores
Coupling
Coupling
Regions are intended to be isolated
Dependencies between regions (coupling) causes fragility
Avoiding coupling requires a change of mindset
Submit changes to multiple regions independently?
Is consistency required?
Should users be bound to a primary and secondary region?
Coupling: example problem
Separating regions introduces an issue coordinating the
batch id for a store
So we introduce a new service to allocate batch IDs.
But both regions need to use it
Now they are coupled. We have introduced a single point of
failure that can break both regions
Coupling: example problem
us-east-1
API Gateway ECS Amazon DynamoDB
Payment
Processor #1
If the batch ID
service in us-east-1
fails, both regions will
be unable to process
Coupling has
introduced fragility
us-west-2
API Gateway ECS
Store network
Client Payment
Processor #2
ID service
Struggling with the batch ID service
We can use a second batch ID service in the West region to
break the coupling
But remember it cant sync to its copy in East any faster than
DynamoDB could (relativity again).
All we did was move the problem 
We can make it active-passive
Batch ID service in each region, active/passive
us-east-1
API Gateway ECS Amazon DynamoDB
Payment
Processor #1
Active-passive
pattern restores
resiliency but adds
more complexity 
more things to break
us-west-2
API Gateway ECS
Store network
Client Payment
Processor #2
ID service
(primary)
ID service
(secondary)
The best solution is to remove the need for strong
consistency across regions
Live with it?
Its also possible to just accept the transaction delay of
strong consistency
(minimum 2x inter-region latency)
Be careful when introducing new regions as the delay may
become unacceptable
us-east-1 -> us-east-2: 15ms
us-east-1 -> ap-southeast-1 (Singapore): 200ms
Reliability principles
Effect of adding components
Overall reliability
is the reliability of
all components
along the
transaction path
(including
network links),
multiplied
us-east-1
API Gateway ECS Amazon DynamoDB
Payment
Processor
Store network
Client
1
2
3
4
5
6
7
Reliability = R1 * R2 * R3 *  * R7
= .999 * .998 * .9999 *
Longer paths are less reliable
Example: if all components are 99.99% reliable:
# Components Reliability
7 99.96%
10 99.9%
20 99.8%
100 99%
So adding components to the transaction
path decreases reliability
Parallel paths increase reliability
All parallel paths have to fail
for the overall element to fail,
so that elements reliability is

Client
element
Reliability (for element) = 1 - Probability 1 and 2 will both fail
= 1 - (1- R1) * (1  R2)
1
2
Ex: for 2 nodes 99% reliable each
Element reliability = 1  (.01)*(.01) = 99.9%
4 nodes with 99% reliability each gives 8 9s reliability (99.999999%)
Heterogenous parallel paths
For survivability, you can introduce multiple disparate
technologies as parallel paths
Ex. move data through both Kinesis and SQS
But doing this increases complexity and possibly cost
So to increase reliability
Limit path length (shorter is better)
Increase parallel paths
Watch out for overall complexity
(more lines of code
or operating cases == more things to fail)
Simple == strong
Failover and recovery
How do you know when to fail over?
Its not obvious.
Most regional failures are partial failures,
Where most of the services are working but a few critical
ones are impaired
Failure modes
may not be simple
(such as transactions that work but are very slow, triggering
app retry storms)
Manual failover
You always want to build and document a manual failover
mechanism
You will need it for DR testing, if nothing else
(Even in an HA setup you should still regularly confirm you
can disable a region completely and switch back)
Auto-failover by business metric
A best-practice strategy is to define a business metric for
success and measure it (ex. via a CloudWatch metric), then
fail over when it decreases.
Ex. Fuel sales/minute
This lets you fail over when something is broken
even if you dont know what
How do you recover back to the failed region?
You need to test if it is healthy before sending traffic back, but:
Remember it doesnt have any live transactions in it any more
2 solutions:
- Try sending some customer traffic in and see what
happens
- Send in synthetic transactions
Synthetic transactions
test transactions  carry fake data and a flag indicating
they are test.
App must filter them from certain steps and reporting
us-east-1
API Gateway ECS Amazon DynamoDB
Payment
Processor
Store network
Client
Allow safely testing health in production
Useful for canaries, manual health tests
Skip sending test
transactions to processor
IsTest = 1
Filter test transactions
from reconciliation reports
(requires app changes)
Manual recovery
As with failover, in active-active configurations
you always want a manual recovery option to force traffic
back to a failed region,
even if you later automate recovery
Since regional impairment is relatively rare and recovery is not as time-
critical as failover, you may stay with only manual recovery
 but be sure you have a way of assessing the regions real
health before failing back
In summation
Dont forget:
Not everything needs to be active-active
Always consider the
 Recovery Point Objective (RPO)  how much data can you
lose during failover: 5 seconds? 5 minutes? 5 hours?
 Recovery Time Objective (RTO)  how long will it take to
recover to that point
 Real availability needs
Everything has a cost
Not everything justifies the cost or effort of reliability
Ex. internal back-office workloads might be less important than
customer-facing workloads)
Its easy to do active-active poorly
and hard to do it right
So take it seriously
A simple traditional DR strategy such as backup/restore or pilot light
may be better than a weak active-active implementation
that introduces complex issues during a crisis
Most reliability happens within the region
 AZs provide sufficient reliability for most workloads
 Multi-region active-active defends against managed
service failures
And even there you can usually survive if critical paths have a second
heterogeneous technology channel (multi-path)
You still have to test.
Force a failover and fail back to make sure things work
Remember to think through:
- Failover: How do you decide a region is degraded
- Recovery: How do you decide a region is healthy
- Data replication between regions
- Avoiding coupling between regions
Complexity is the enemy
The simpler, the better
Keep transaction paths short
Parallel paths increase reliability
And remember
we are here to help
Resources:
 ARC 319 How to Design a Multi-Region Active-Active
Architecture session at re:Invent 2017
 AWS Blog post: Architecting Multi-Region SaaS solutions
on AWS
 AWS Solutions Active-Passive model
 This Is My Architecture session: SimilarWeb: Route 53
Calculated Health Checks for an Active/Active Multi-
Region Architecture
appendix
Is the speed of light really a factor between regions?
Its a great question.
Earth circumference = 40,000 km = 4 x 10^7 m
Speed of light in vacuum = 3 x 10^8 m/s
Theoretical best time for a light pulse to circle Earth
= 4 / 30 sec = 133 ms
We have to send it through glass (index of refraction 1.5),
increasing time x 1.5 = 200ms (light moves slower through glass)
Theoretical best time to go halfway around the world through fiber
is 100ms
In practice, electro-optical switching (need to amplify periodically),
and zig-zag add up to 4x:
NYC -> LA = 4,000 km -> 20ms ideal, ~80ms in practice
Well can we get better?
If you get rid of the fiber and go speed-of-light-in vacuum,
you reclaim the 1.5 factor.
But then you need something low in the atmosphere to
relay the signal to get around the curve of the planet.
Hmm Stay tuned.
(but learn to architect for speed of light delay in multi-
region architectures anyway)

More Related Content

Active-Active Multi-Region Architectures.pdf

  • 2. Active-Active Architecture Overview the most sophisticated cloud operations pattern Pros: - Provides DR and HA both at once - Fast recovery from problems - Highly efficient use of spend
  • 3. Active-Active Architecture - Cons - Requires focused approach and deep thought - If implemented poorly, can decrease reliability - Usually requires application changes
  • 4. Main challenges in Active-Active - Failover: How do you decide a region is degraded - Recovery: How do you decide a region is healthy - Data replication between regions - Avoiding coupling between regions
  • 6. Data Replication Do you absolutely need to do it? think this over carefully for each data element Can you live with eventual consistency between regions? (HINT: the answer had better be yes) unless you are prepared to live with very slow transactions not negotiable. its physics
  • 7. Data replication Special relativity Regions are separated by distances significant enough that speed of light delay becomes relevant to effect your application. Believe it or not, this becomes a Special Relativity problem. What? Like Albert Einstein 1905 Special Relativity? Yes.
  • 8. Frames of Reference Special relativity deals with the special case where two frames are not accelerating relative to one another In relativity, observers live in frames of reference General relativity deals with the more general case where they are
  • 9. Distance matters If the observers are close together, we can ignore relativity. If they are separated far enough that speed of light becomes significant for measurements, we cant anymore 80 ms
  • 10. Involuntary regional time travel Region A Region B 80 ms 80 ms To ask a question and get a response (ex. whats the current batch number?) takes 160ms - and the answer is 80ms old Region A always sees a view of Region B that is 80ms in the past (and vice-versa)
  • 11. Synchronous Replication (is slooooowwww) If youre starting to think This will be a problem for my application, youre right. How do you get what a value is now? You cant. Since you can only see into the other regions past, the best you can do is send it a message to freeze any local updates until you tell it to resume and send you the current value, then tell it when youre done.
  • 12. Hold up If synchronous replication across regions is starting to sound like a bad idea that can significantly delay transactions in both regions and could rapidly snowball into a mess, youre getting the idea. The best solution is to not do it
  • 13. Break the dependency We need to go back to the data and re-organize to eliminate the need to coordinate between regions Lets explore how:
  • 14. Example - Credit card processing application Single region us-east-1 API Gateway ECS Amazon DynamoDB Payment Processor Store network Client Stores send in credit card transactions. Container tasks communicate with an external payment processor and coordinate the current batch ID for a store via a DynamoDB table
  • 15. Example - Credit card processing application Multi-region active-active us-east-1 API Gateway ECS Amazon DynamoDB Payment Processor Stores distribute transactions between regions. DynamoDB global tables can sync the data between the regions, but only eventually due to the speed-of-light delay no longer sufficient for coordinating batch ID across all the ECS tasks us-west-2 API Gateway ECS Amazon DynamoDB Store network Client 80 ms delay
  • 16. Solution #1 separate payment processors us-east-1 API Gateway ECS Amazon DynamoDB Payment Processor #1 Regions talk to separate instances of the payment processor (or different payment processors completely) need to coordinate across regions is eliminated us-west-2 API Gateway ECS Amazon DynamoDB Store network Client Payment Processor #2
  • 17. Solution #2 separate batch sequences Multi-region active-active us-east-1 API Gateway ECS Amazon DynamoDB Payment Processor East and West regions use even and odd batch sequences, so can never conflict A store can have multiple batches open at one time, one in East and one in West us-west-2 API Gateway ECS Amazon DynamoDB Store network Client Batch 2, 4, 6 Batch 1, 3, 5
  • 18. Solution #3 geographic preference us-east-1 API Gateway ECS Amazon DynamoDB Payment Processor Store network Client Stores assigned to groups by geography. Normally send to primary region, but can switch to secondary region if needed Pros: low store-to- region latency Cons: load distribution not as perfect, edge cases on failover/back us-west-2 API Gateway ECS Amazon DynamoDB Client West stores East stores
  • 20. Coupling Regions are intended to be isolated Dependencies between regions (coupling) causes fragility Avoiding coupling requires a change of mindset Submit changes to multiple regions independently? Is consistency required? Should users be bound to a primary and secondary region?
  • 21. Coupling: example problem Separating regions introduces an issue coordinating the batch id for a store So we introduce a new service to allocate batch IDs. But both regions need to use it Now they are coupled. We have introduced a single point of failure that can break both regions
  • 22. Coupling: example problem us-east-1 API Gateway ECS Amazon DynamoDB Payment Processor #1 If the batch ID service in us-east-1 fails, both regions will be unable to process Coupling has introduced fragility us-west-2 API Gateway ECS Store network Client Payment Processor #2 ID service
  • 23. Struggling with the batch ID service We can use a second batch ID service in the West region to break the coupling But remember it cant sync to its copy in East any faster than DynamoDB could (relativity again). All we did was move the problem We can make it active-passive
  • 24. Batch ID service in each region, active/passive us-east-1 API Gateway ECS Amazon DynamoDB Payment Processor #1 Active-passive pattern restores resiliency but adds more complexity more things to break us-west-2 API Gateway ECS Store network Client Payment Processor #2 ID service (primary) ID service (secondary)
  • 25. The best solution is to remove the need for strong consistency across regions
  • 26. Live with it? Its also possible to just accept the transaction delay of strong consistency (minimum 2x inter-region latency) Be careful when introducing new regions as the delay may become unacceptable us-east-1 -> us-east-2: 15ms us-east-1 -> ap-southeast-1 (Singapore): 200ms
  • 28. Effect of adding components Overall reliability is the reliability of all components along the transaction path (including network links), multiplied us-east-1 API Gateway ECS Amazon DynamoDB Payment Processor Store network Client 1 2 3 4 5 6 7 Reliability = R1 * R2 * R3 * * R7 = .999 * .998 * .9999 *
  • 29. Longer paths are less reliable Example: if all components are 99.99% reliable: # Components Reliability 7 99.96% 10 99.9% 20 99.8% 100 99% So adding components to the transaction path decreases reliability
  • 30. Parallel paths increase reliability All parallel paths have to fail for the overall element to fail, so that elements reliability is Client element Reliability (for element) = 1 - Probability 1 and 2 will both fail = 1 - (1- R1) * (1 R2) 1 2 Ex: for 2 nodes 99% reliable each Element reliability = 1 (.01)*(.01) = 99.9% 4 nodes with 99% reliability each gives 8 9s reliability (99.999999%)
  • 31. Heterogenous parallel paths For survivability, you can introduce multiple disparate technologies as parallel paths Ex. move data through both Kinesis and SQS But doing this increases complexity and possibly cost
  • 32. So to increase reliability Limit path length (shorter is better) Increase parallel paths Watch out for overall complexity (more lines of code or operating cases == more things to fail) Simple == strong
  • 34. How do you know when to fail over? Its not obvious. Most regional failures are partial failures, Where most of the services are working but a few critical ones are impaired Failure modes may not be simple (such as transactions that work but are very slow, triggering app retry storms)
  • 35. Manual failover You always want to build and document a manual failover mechanism You will need it for DR testing, if nothing else (Even in an HA setup you should still regularly confirm you can disable a region completely and switch back)
  • 36. Auto-failover by business metric A best-practice strategy is to define a business metric for success and measure it (ex. via a CloudWatch metric), then fail over when it decreases. Ex. Fuel sales/minute This lets you fail over when something is broken even if you dont know what
  • 37. How do you recover back to the failed region? You need to test if it is healthy before sending traffic back, but: Remember it doesnt have any live transactions in it any more 2 solutions: - Try sending some customer traffic in and see what happens - Send in synthetic transactions
  • 38. Synthetic transactions test transactions carry fake data and a flag indicating they are test. App must filter them from certain steps and reporting us-east-1 API Gateway ECS Amazon DynamoDB Payment Processor Store network Client Allow safely testing health in production Useful for canaries, manual health tests Skip sending test transactions to processor IsTest = 1 Filter test transactions from reconciliation reports (requires app changes)
  • 39. Manual recovery As with failover, in active-active configurations you always want a manual recovery option to force traffic back to a failed region, even if you later automate recovery Since regional impairment is relatively rare and recovery is not as time- critical as failover, you may stay with only manual recovery but be sure you have a way of assessing the regions real health before failing back
  • 41. Dont forget: Not everything needs to be active-active Always consider the Recovery Point Objective (RPO) how much data can you lose during failover: 5 seconds? 5 minutes? 5 hours? Recovery Time Objective (RTO) how long will it take to recover to that point Real availability needs Everything has a cost Not everything justifies the cost or effort of reliability Ex. internal back-office workloads might be less important than customer-facing workloads)
  • 42. Its easy to do active-active poorly and hard to do it right So take it seriously A simple traditional DR strategy such as backup/restore or pilot light may be better than a weak active-active implementation that introduces complex issues during a crisis
  • 43. Most reliability happens within the region AZs provide sufficient reliability for most workloads Multi-region active-active defends against managed service failures And even there you can usually survive if critical paths have a second heterogeneous technology channel (multi-path)
  • 44. You still have to test. Force a failover and fail back to make sure things work
  • 45. Remember to think through: - Failover: How do you decide a region is degraded - Recovery: How do you decide a region is healthy - Data replication between regions - Avoiding coupling between regions
  • 46. Complexity is the enemy The simpler, the better Keep transaction paths short Parallel paths increase reliability
  • 47. And remember we are here to help
  • 48. Resources: ARC 319 How to Design a Multi-Region Active-Active Architecture session at re:Invent 2017 AWS Blog post: Architecting Multi-Region SaaS solutions on AWS AWS Solutions Active-Passive model This Is My Architecture session: SimilarWeb: Route 53 Calculated Health Checks for an Active/Active Multi- Region Architecture
  • 50. Is the speed of light really a factor between regions? Its a great question. Earth circumference = 40,000 km = 4 x 10^7 m Speed of light in vacuum = 3 x 10^8 m/s Theoretical best time for a light pulse to circle Earth = 4 / 30 sec = 133 ms We have to send it through glass (index of refraction 1.5), increasing time x 1.5 = 200ms (light moves slower through glass) Theoretical best time to go halfway around the world through fiber is 100ms In practice, electro-optical switching (need to amplify periodically), and zig-zag add up to 4x: NYC -> LA = 4,000 km -> 20ms ideal, ~80ms in practice
  • 51. Well can we get better? If you get rid of the fiber and go speed-of-light-in vacuum, you reclaim the 1.5 factor. But then you need something low in the atmosphere to relay the signal to get around the curve of the planet. Hmm Stay tuned. (but learn to architect for speed of light delay in multi- region architectures anyway)