2. Active-Active Architecture Overview
the most sophisticated cloud operations pattern
Pros:
- Provides DR and HA both at once
- Fast recovery from problems
- Highly efficient use of spend
3. Active-Active Architecture - Cons
- Requires focused approach and deep thought
- If implemented poorly, can decrease reliability
- Usually requires application changes
4. Main challenges in Active-Active
- Failover: How do you decide a region is degraded
- Recovery: How do you decide a region is healthy
- Data replication between regions
- Avoiding coupling between regions
6. Data Replication
Do you absolutely need to do it?
think this over carefully
for each data element
Can you live with eventual consistency between regions?
(HINT: the answer had better be yes)
unless you are prepared to live with very slow transactions
not negotiable. its physics
7. Data replication Special relativity
Regions are separated by distances significant enough that
speed of light delay becomes relevant to effect your
application.
Believe it or not, this becomes a Special Relativity problem.
What? Like Albert Einstein 1905 Special Relativity?
Yes.
8. Frames of Reference
Special relativity deals with the
special case where two frames
are not accelerating relative to
one another
In relativity, observers live in frames of reference
General relativity deals with the
more general case where they
are
9. Distance matters
If the observers
are close
together, we can
ignore relativity.
If they are separated far enough that
speed of light becomes significant for
measurements, we cant anymore
80 ms
10. Involuntary regional time travel
Region A Region B
80 ms 80 ms
To ask a question and get a response (ex. whats the current
batch number?) takes 160ms - and the answer is 80ms old
Region A always sees a view
of Region B that is 80ms in the
past (and vice-versa)
11. Synchronous Replication (is slooooowwww)
If youre starting to think This will be a problem for my
application, youre right.
How do you get what a value is now?
You cant. Since you can only see into the other regions
past, the best you can do is send it a message to freeze any
local updates until you tell it to resume and send you the
current value, then tell it when youre done.
12. Hold up
If synchronous replication across regions is
starting to sound like a bad idea
that can significantly delay transactions in both regions and
could rapidly snowball into a mess,
youre getting the idea.
The best solution is to not do it
13. Break the dependency
We need to go back to the data and re-organize to eliminate
the need to coordinate between regions
Lets explore how:
14. Example - Credit card processing application
Single region
us-east-1
API Gateway ECS Amazon DynamoDB
Payment
Processor
Store network
Client
Stores send in credit
card transactions.
Container tasks
communicate with an
external payment
processor and
coordinate the
current batch ID for a
store via a
DynamoDB table
15. Example - Credit card processing application
Multi-region active-active
us-east-1
API Gateway ECS Amazon DynamoDB
Payment
Processor
Stores distribute
transactions between
regions.
DynamoDB global
tables can sync the
data between the
regions, but only
eventually due to the
speed-of-light delay
no longer sufficient for
coordinating batch ID
across all the ECS
tasks
us-west-2
API Gateway ECS Amazon DynamoDB
Store network
Client
80 ms
delay
16. Solution #1 separate payment processors
us-east-1
API Gateway ECS Amazon DynamoDB
Payment
Processor #1
Regions talk to
separate instances of
the payment
processor (or
different payment
processors
completely) need to
coordinate across
regions is eliminated
us-west-2
API Gateway ECS Amazon DynamoDB
Store network
Client Payment
Processor #2
17. Solution #2 separate batch sequences
Multi-region active-active
us-east-1
API Gateway ECS Amazon DynamoDB
Payment
Processor
East and West
regions use even and
odd batch
sequences, so can
never conflict
A store can have
multiple batches
open at one time,
one in East and one
in West
us-west-2
API Gateway ECS Amazon DynamoDB
Store network
Client
Batch 2, 4, 6
Batch 1, 3, 5
18. Solution #3 geographic preference
us-east-1
API Gateway ECS Amazon DynamoDB
Payment
Processor
Store network
Client
Stores assigned to
groups by
geography. Normally
send to primary
region, but can
switch to secondary
region if needed
Pros: low store-to-
region latency
Cons: load
distribution not as
perfect, edge cases
on failover/back
us-west-2
API Gateway ECS Amazon DynamoDB
Client
West stores
East stores
20. Coupling
Regions are intended to be isolated
Dependencies between regions (coupling) causes fragility
Avoiding coupling requires a change of mindset
Submit changes to multiple regions independently?
Is consistency required?
Should users be bound to a primary and secondary region?
21. Coupling: example problem
Separating regions introduces an issue coordinating the
batch id for a store
So we introduce a new service to allocate batch IDs.
But both regions need to use it
Now they are coupled. We have introduced a single point of
failure that can break both regions
22. Coupling: example problem
us-east-1
API Gateway ECS Amazon DynamoDB
Payment
Processor #1
If the batch ID
service in us-east-1
fails, both regions will
be unable to process
Coupling has
introduced fragility
us-west-2
API Gateway ECS
Store network
Client Payment
Processor #2
ID service
23. Struggling with the batch ID service
We can use a second batch ID service in the West region to
break the coupling
But remember it cant sync to its copy in East any faster than
DynamoDB could (relativity again).
All we did was move the problem
We can make it active-passive
24. Batch ID service in each region, active/passive
us-east-1
API Gateway ECS Amazon DynamoDB
Payment
Processor #1
Active-passive
pattern restores
resiliency but adds
more complexity
more things to break
us-west-2
API Gateway ECS
Store network
Client Payment
Processor #2
ID service
(primary)
ID service
(secondary)
25. The best solution is to remove the need for strong
consistency across regions
26. Live with it?
Its also possible to just accept the transaction delay of
strong consistency
(minimum 2x inter-region latency)
Be careful when introducing new regions as the delay may
become unacceptable
us-east-1 -> us-east-2: 15ms
us-east-1 -> ap-southeast-1 (Singapore): 200ms
28. Effect of adding components
Overall reliability
is the reliability of
all components
along the
transaction path
(including
network links),
multiplied
us-east-1
API Gateway ECS Amazon DynamoDB
Payment
Processor
Store network
Client
1
2
3
4
5
6
7
Reliability = R1 * R2 * R3 * * R7
= .999 * .998 * .9999 *
29. Longer paths are less reliable
Example: if all components are 99.99% reliable:
# Components Reliability
7 99.96%
10 99.9%
20 99.8%
100 99%
So adding components to the transaction
path decreases reliability
30. Parallel paths increase reliability
All parallel paths have to fail
for the overall element to fail,
so that elements reliability is
Client
element
Reliability (for element) = 1 - Probability 1 and 2 will both fail
= 1 - (1- R1) * (1 R2)
1
2
Ex: for 2 nodes 99% reliable each
Element reliability = 1 (.01)*(.01) = 99.9%
4 nodes with 99% reliability each gives 8 9s reliability (99.999999%)
31. Heterogenous parallel paths
For survivability, you can introduce multiple disparate
technologies as parallel paths
Ex. move data through both Kinesis and SQS
But doing this increases complexity and possibly cost
32. So to increase reliability
Limit path length (shorter is better)
Increase parallel paths
Watch out for overall complexity
(more lines of code
or operating cases == more things to fail)
Simple == strong
34. How do you know when to fail over?
Its not obvious.
Most regional failures are partial failures,
Where most of the services are working but a few critical
ones are impaired
Failure modes
may not be simple
(such as transactions that work but are very slow, triggering
app retry storms)
35. Manual failover
You always want to build and document a manual failover
mechanism
You will need it for DR testing, if nothing else
(Even in an HA setup you should still regularly confirm you
can disable a region completely and switch back)
36. Auto-failover by business metric
A best-practice strategy is to define a business metric for
success and measure it (ex. via a CloudWatch metric), then
fail over when it decreases.
Ex. Fuel sales/minute
This lets you fail over when something is broken
even if you dont know what
37. How do you recover back to the failed region?
You need to test if it is healthy before sending traffic back, but:
Remember it doesnt have any live transactions in it any more
2 solutions:
- Try sending some customer traffic in and see what
happens
- Send in synthetic transactions
38. Synthetic transactions
test transactions carry fake data and a flag indicating
they are test.
App must filter them from certain steps and reporting
us-east-1
API Gateway ECS Amazon DynamoDB
Payment
Processor
Store network
Client
Allow safely testing health in production
Useful for canaries, manual health tests
Skip sending test
transactions to processor
IsTest = 1
Filter test transactions
from reconciliation reports
(requires app changes)
39. Manual recovery
As with failover, in active-active configurations
you always want a manual recovery option to force traffic
back to a failed region,
even if you later automate recovery
Since regional impairment is relatively rare and recovery is not as time-
critical as failover, you may stay with only manual recovery
but be sure you have a way of assessing the regions real
health before failing back
41. Dont forget:
Not everything needs to be active-active
Always consider the
Recovery Point Objective (RPO) how much data can you
lose during failover: 5 seconds? 5 minutes? 5 hours?
Recovery Time Objective (RTO) how long will it take to
recover to that point
Real availability needs
Everything has a cost
Not everything justifies the cost or effort of reliability
Ex. internal back-office workloads might be less important than
customer-facing workloads)
42. Its easy to do active-active poorly
and hard to do it right
So take it seriously
A simple traditional DR strategy such as backup/restore or pilot light
may be better than a weak active-active implementation
that introduces complex issues during a crisis
43. Most reliability happens within the region
AZs provide sufficient reliability for most workloads
Multi-region active-active defends against managed
service failures
And even there you can usually survive if critical paths have a second
heterogeneous technology channel (multi-path)
44. You still have to test.
Force a failover and fail back to make sure things work
45. Remember to think through:
- Failover: How do you decide a region is degraded
- Recovery: How do you decide a region is healthy
- Data replication between regions
- Avoiding coupling between regions
46. Complexity is the enemy
The simpler, the better
Keep transaction paths short
Parallel paths increase reliability
48. Resources:
ARC 319 How to Design a Multi-Region Active-Active
Architecture session at re:Invent 2017
AWS Blog post: Architecting Multi-Region SaaS solutions
on AWS
AWS Solutions Active-Passive model
This Is My Architecture session: SimilarWeb: Route 53
Calculated Health Checks for an Active/Active Multi-
Region Architecture
50. Is the speed of light really a factor between regions?
Its a great question.
Earth circumference = 40,000 km = 4 x 10^7 m
Speed of light in vacuum = 3 x 10^8 m/s
Theoretical best time for a light pulse to circle Earth
= 4 / 30 sec = 133 ms
We have to send it through glass (index of refraction 1.5),
increasing time x 1.5 = 200ms (light moves slower through glass)
Theoretical best time to go halfway around the world through fiber
is 100ms
In practice, electro-optical switching (need to amplify periodically),
and zig-zag add up to 4x:
NYC -> LA = 4,000 km -> 20ms ideal, ~80ms in practice
51. Well can we get better?
If you get rid of the fiber and go speed-of-light-in vacuum,
you reclaim the 1.5 factor.
But then you need something low in the atmosphere to
relay the signal to get around the curve of the planet.
Hmm Stay tuned.
(but learn to architect for speed of light delay in multi-
region architectures anyway)