�ݺ�ߣ

Active-Active Multi-Region Architecture
Considerations
David Rostcheck

Active-Active Architecture Overview
the most sophisticated cloud operations pattern
Pros:
- Provides DR and HA both at once
- Fast recovery from problems
- Highly efficient use of spend

Active-Active Architecture - Cons
- Requires focused approach and deep thought
- If implemented poorly, can decrease reliability
- Usually requires application changes

Main challenges in Active-Active
- Failover: How do you decide a region is degraded
- Recovery: How do you decide a region is healthy
- Data replication between regions
- Avoiding coupling between regions

Data Replication
Do you absolutely need to do it?
think this over carefully
for each data element
Can you live with eventual consistency between regions?
(HINT: the answer had better be yes)
unless you are prepared to live with very slow transactions
not negotiable. it’s physics

Data replication – Special relativity
Regions are separated by distances significant enough that
speed of light delay becomes relevant to effect your
application.
Believe it or not, this becomes a Special Relativity problem.
What? Like Albert Einstein 1905 Special Relativity?
Yes.

Frames of Reference
Special relativity deals with the
special case where two frames
are not accelerating relative to
one another
In relativity, observers live in “frames of reference”
General relativity deals with the
more general case where they
are

Distance matters
If the observers
are close
together, we can
ignore relativity.
If they are separated far enough that
speed of light becomes significant for
measurements, we can’t anymore
80 ms

Involuntary regional time travel
Region A Region B
80 ms 80 ms
To ask a question and get a response (ex. ”what’s the current
batch number?”) takes 160ms - and the answer is 80ms old
Region A always sees a view
of Region B that is 80ms in the
past (and vice-versa)

Synchronous Replication (is slooooowwww)
If you’re starting to think “This will be a problem for my
application”, you’re right.
How do you get what a value is now?
You can’t. Since you can only see into the other region’s
past, the best you can do is send it a message to freeze any
local updates until you tell it to resume and send you the
current value, then tell it when you’re done.

Hold up…
If synchronous replication across regions is
starting to sound like a bad idea
that can significantly delay transactions in both regions and
could rapidly snowball into a mess,
you’re getting the idea.
The best solution is to not do it

Break the dependency
We need to go back to the data and re-organize to eliminate
the need to coordinate between regions
Let’s explore how:

Example - Credit card processing application
Single region
us-east-1
API Gateway ECS Amazon DynamoDB
Payment
Processor
Store network
Client
Stores send in credit
card transactions.
Container tasks
communicate with an
external payment
processor and
coordinate the
current batch ID for a
store via a
DynamoDB table

Example - Credit card processing application
Multi-region active-active
us-east-1
Payment
Processor
Stores distribute
transactions between
regions.
DynamoDB global
tables can sync the
data between the
regions, but only
eventually due to the
speed-of-light delay –
no longer sufficient for
coordinating batch ID
across all the ECS
tasks
us-west-2
Store network
Client
80 ms
delay

Solution #1 – separate payment processors
us-east-1
Payment
Processor #1
Regions talk to
separate instances of
the payment
processor (or
different payment
processors
completely) – need to
coordinate across
regions is eliminated
us-west-2
Store network
Client Payment
Processor #2

Solution #2 – separate batch sequences
Multi-region active-active
us-east-1
Payment
Processor
East and West
regions use even and
odd batch
sequences, so can
never conflict
A store can have
multiple batches
open at one time,
one in East and one
in West
us-west-2
Store network
Client
Batch 2, 4, 6…
Batch 1, 3, 5

Solution #3 – geographic preference
us-east-1
Payment
Processor
Store network
Client
Stores assigned to
groups by
geography. Normally
send to primary
region, but can
switch to secondary
region if needed
Pros: low store-to-
region latency
Cons: load
distribution not as
perfect, edge cases
on failover/back
us-west-2
Client
West stores
East stores

Coupling
Regions are intended to be isolated
Dependencies between regions (coupling) causes fragility
Avoiding coupling requires a change of mindset
Submit changes to multiple regions independently?
Is consistency required?
Should users be bound to a primary and secondary region?

Coupling: example problem
Separating regions introduces an issue coordinating the
batch id for a store
So we introduce a new service to allocate batch IDs.
But both regions need to use it…
Now they are coupled. We have introduced a single point of
failure that can break both regions

Coupling: example problem
us-east-1
Payment
Processor #1
If the batch ID
service in us-east-1
fails, both regions will
be unable to process
Coupling has
introduced fragility
us-west-2
API Gateway ECS
Store network
Client Payment
Processor #2
ID service

Struggling with the batch ID service
We can use a second batch ID service in the West region to
break the coupling
But remember it can’t sync to its copy in East any faster than
DynamoDB could (relativity again).
All we did was move the problem ☹
We can make it active-passive…

Batch ID service in each region, active/passive
us-east-1
Payment
Processor #1
Active-passive
pattern restores
resiliency but adds
more complexity –
more things to break
us-west-2
API Gateway ECS
Store network
Client Payment
Processor #2
ID service
(primary)
ID service
(secondary)

The best solution is to remove the need for strong
consistency across regions

Live with it?
It’s also possible to just accept the transaction delay of
strong consistency
(minimum 2x inter-region latency)
Be careful when introducing new regions as the delay may
become unacceptable
us-east-1 -> us-east-2: 15ms
us-east-1 -> ap-southeast-1 (Singapore): 200ms

Effect of adding components
Overall reliability
is the reliability of
all components
along the
transaction path
(including
network links),
multiplied
us-east-1
Payment
Processor
Store network
Client
1
2
3
4
5
6
7
Reliability = R1 * R2 * R3 * … * R7
= .999 * .998 * .9999 * …

Longer paths are less reliable
Example: if all components are 99.99% reliable:
# Components Reliability
7 99.96%
10 99.9%
20 99.8%
100 99%
So adding components to the transaction
path decreases reliability

Parallel paths increase reliability
All parallel paths have to fail
for the overall element to fail,
so that element’s reliability is
…
Client
element
Reliability (for element) = 1 - Probability 1 and 2 will both fail
= 1 - (1- R1) * (1 – R2)
1
2
Ex: for 2 nodes 99% reliable each
Element reliability = 1 – (.01)*(.01) = 99.9%
4 nodes with 99% reliability each gives 8 9’s reliability (99.999999%)

Heterogenous parallel paths
For survivability, you can introduce multiple disparate
technologies as parallel paths
Ex. move data through both Kinesis and SQS
But doing this increases complexity and possibly cost

So to increase reliability
Limit path length (shorter is better)
Increase parallel paths
Watch out for overall complexity
(more lines of code
or operating cases == more things to fail)
Simple == strong

How do you know when to fail over?
It’s not obvious.
Most regional “failures” are partial failures,
Where most of the services are working but a few critical
ones are impaired
Failure modes
may not be simple
(such as transactions that work but are very slow, triggering
app retry storms)

Manual failover
You always want to build and document a manual failover
mechanism
You will need it for DR testing, if nothing else
(Even in an HA setup you should still regularly confirm you
can disable a region completely and switch back)

Auto-failover by business metric
A best-practice strategy is to define a business metric for
success and measure it (ex. via a CloudWatch metric), then
fail over when it decreases.
Ex. Fuel sales/minute
This lets you fail over when something is broken
even if you don’t know what

How do you recover back to the failed region?
You need to test if it is healthy before sending traffic back, but:
Remember it doesn’t have any live transactions in it any more
2 solutions:
- Try sending some customer traffic in and see what
happens
- Send in synthetic transactions

Synthetic transactions
“test transactions” – carry fake data and a flag indicating
they are test.
App must filter them from certain steps and reporting
us-east-1
Payment
Processor
Store network
Client
Allow safely testing health in production
Useful for canaries, manual health tests
Skip sending test
transactions to processor
IsTest = 1
Filter test transactions
from reconciliation reports
(requires app changes)

Manual recovery
As with failover, in active-active configurations
you always want a manual recovery option to force traffic
back to a failed region,
even if you later automate recovery
Since regional impairment is relatively rare and recovery is not as time-
critical as failover, you may stay with only manual recovery
– but be sure you have a way of assessing the region’s real
health before failing back

Don’t’ forget:
Not everything needs to be active-active
Always consider the
• Recovery Point Objective (RPO) – how much data can you
lose during failover: 5 seconds? 5 minutes? 5 hours?
• Recovery Time Objective (RTO) – how long will it take to
recover to that point
• Real availability needs
Everything has a cost
Not everything justifies the cost or effort of reliability
Ex. internal back-office workloads might be less important than
customer-facing workloads)

It’s easy to do active-active poorly
and hard to do it right
So take it seriously
A simple traditional DR strategy such as backup/restore or pilot light
may be better than a weak active-active implementation
that introduces complex issues during a crisis

Most reliability happens within the region
• AZs provide sufficient reliability for most workloads
• Multi-region active-active defends against managed
service failures
And even there you can usually survive if critical paths have a second
heterogeneous technology channel (multi-path)

You still have to test.
Force a failover and fail back to make sure things work

Remember to think through:
- Failover: How do you decide a region is degraded
- Recovery: How do you decide a region is healthy
- Data replication between regions
- Avoiding coupling between regions

Complexity is the enemy
The simpler, the better
Keep transaction paths short
Parallel paths increase reliability

And remember…
we are here to help

Resources:
• ARC 319 How to Design a Multi-Region Active-Active
Architecture session at re:Invent 2017
• AWS Blog post: Architecting Multi-Region SaaS solutions
on AWS
• AWS Solutions Active-Passive model
• This Is My Architecture session: SimilarWeb: Route 53
Calculated Health Checks for an Active/Active Multi-
Region Architecture

Is the speed of light really a factor between regions?
It’s a great question.
Earth circumference = 40,000 km = 4 x 10^7 m
Speed of light in vacuum = 3 x 10^8 m/s
Theoretical best time for a light pulse to circle Earth
= 4 / 30 sec = 133 ms
We have to send it through glass (index of refraction 1.5),
increasing time x 1.5 = 200ms (light moves slower through glass)
Theoretical best time to go halfway around the world through fiber
is 100ms
In practice, electro-optical switching (need to amplify periodically),
and zig-zag add up to 4x:
NYC -> LA = 4,000 km -> 20ms ideal, ~80ms in practice

Well can we get better?
If you get rid of the fiber and go speed-of-light-in vacuum,
you reclaim the 1.5 factor.
But then you need something low in the atmosphere to
relay the signal to get around the curve of the planet.
Hmm… Stay tuned.
(but learn to architect for speed of light delay in multi-
region architectures anyway)

�ݺ�ߣ

Active-Active Multi-Region Architectures.pdf

Convert to study materialsBETA

More Related Content

Active-Active Multi-Region Architectures.pdf