際際滷

際際滷Share a Scribd company logo
RADOS LEVEL REPLICATION
Xuehan Xu
ISSUES
 The Overall Architecture
 Main cluster implementation
 Backup Cluster implementation
The Overall Architecture
The Overall Architecture
 Implementation Principals:
 Modify the current system as less as possible and reuse the current system
components as much as possible
 Make as little impact on the performance of other system components as
possible
ISSUES
 The Whole Architecture
 Main cluster implementation
 Backup Cluster implementation
Main Cluster Components
 Classification of Ops
 Ops issued by clients(librados ops);
 Ops issued by other OSDs(repops);
Only ops issued by clients needs to be replicated.
 Main Difficulties:
 Deal with various situations that replication can not go on.
 Solutions: later sections
 Preserve ops order in replication when osdmap changes
 Solution:
 Make sure replication journal(FileJournal) entries for the ops needing to be replicated are
considered removable only when the corresponding replication is finished
 During the recovery/backfill phase, make sure an object gets recovered only when all original
ops targeting it gets replicated
Main Cluster Components
 REPLICATION SUSPEND
 If there are, in the acting set of a pg, enough OSDs whose replication ops
cache is full, then replication of ops targeting that pg should be suspended,
and also, monitors should suspend sendint out snapshot notifications
 If the backup cluster is replication full, suspend the replication.
 SNAPSHOT NOTIFICATION SUSPEND
 When REPLICATION SUSPEND condition is met, snapshot notification should
also be suspended;
 When the system clocks of all OSDs in a pgs acting set are out-of-sync,
snapshot notification should be suspended;
Main Cluster Components
 Replication Ops Cache: reuse FileJournal(As for
now, we have only considered FileStore)
 Introduce to journal header: op_replication_head,
op_replication_tail; op_replication_tail -
op_replication_head journal_replication_threshold. If
journal_head hits op_replication_tail, move it to
op_replication_head + 1. Only journal entries outside
[journal_head,Journal_tail]U[op_replication_head,
op_replication_tail] can be removed
 Introduce to journal entry: need_replicate, indicating that
this op needs to be replicated.
Main Cluster Components
 New flags in Object Info:
 need_replicate: set/unset by clients, indicating whether ops targeting this object
need to be replicated;
 need_full_replication: set/unset by osd, indicating whether there were ops targeting
this object not replicated because of "REPLICATION SUSPEND condition.
 New module ReplicationWorker
 Conceptually comprised of a cache used to store ops to be replicated and a working
thread that continuously replicate those ops targeting pgs of which the current OSD
is acting primary.
 ReplicationWorker remove ops when they are replicated(a commit message is
received from OSDs in the backup cluster or a can remove message is received
from their acting primaries) and move forward journals pointer
op_replication_tail(op_replication_head is moved forward by OSDs work thread)
 When an op is replicated, ReplicationWorker should also send a can remove
message to its pgs other OSDs.
Main Cluster Components
 Replication Robustness
 New configuration item replication_min_size: the least number of OSDs
that should hold the replica of librados ops while the replication of it is not
complete yet
 Replica OSDs should encapsulate a flag indicating whether
op_replication_tail is moved forward by a repop in the reply to that repop
 If less than replication_min_size OSDs has moved forward their
op_replication_tail, then the pg should be marked as replicaton full and
all subsequent librados ops cant be replicated until that mark is cleared
 When an OSDs replication ops cache is left with enough space, a message
should be sent to all acting primarys who has a common pg with it, when
OSDs capable of caching librados ops for replication are enough, the pgs
replication full mark is cleared
Main Cluster Components
 Replication Robustness
 For the purpose of better performance, acting primary doesnt need to wait
for the repop reply, all target objects are marked as need_full_replication,
and are all cleared of that mark when the librados op is replicated.
 When a librados ops replication is done, the replication executing OSD send
out replication succeeded message for that op to other OSDs in acting set.
 Replication is done by the first OSD that has enough Replication Ops Cache
in the acting set which is actually a list. This decision is made by acting
primary.
 New field in PG INFO:
 need_full_replication_set: a bloom filter that filters out objects that are
definitely not marked as need_full_replication
Main Cluster Components
 Peering/Recovery/Backfill procedure
 The only thing that need to be add to this procedure is that an objects
recovery source should replicate all librados ops before pushing that object
ISSUES
 The Whole Architecture
 Main cluster implementation
 Backup Cluster implementation
Backup Cluster Implementation
 Main Difficulty:
 Assuring efficient replication ops caching and efficient recovery when osdmap
changes.
 Solution:
 Add an additional processing step for replication op processing between ops
journaling and ops applying: raw ops storing
 In the raw ops storing phase, a three tuple <timestamp, object_id, op> is
stored in Rocksdb
 Ops applying happens when the backup clusters monitor receives an
snapshot notification that tells replication ops earlier that certain time point
can be merged to the backing store.
Backup Cluster Implementation
 Solution:
 Adding a raw ops storing phase is for the reason that, when doing
recovery/backfill, both unmerged ops and the target object need to be
recovered and storing the three tuple <timestamp, object_id, op> in Rocksdb
can make the query like what ops need to be recovered for a certain object
more efficient.
 NOTE: only replication ops need the raw ops storing phase.
Backup Cluster Implementation
 In-memory cache(FileStore::op_wq)
 To make the ops applying more efficient, there should be a new thread that
keep feeding the FileStore::op_wq with replication ops stored in Rocksdb in
raw ops storing phase.
 FileStore::op_tp should also distinguish the processing of replication ops
and other ops, since replication ops can only be applied when the
corresponding snapshot notification is received.
 Also, replication ops targeting the same object should be merged to the
greatest extent before applied to the object, which could make the applying
phase more efficient and, further, make the whole replication more efficient
since the applying phase could probably be the FINAL bottleneck of the
whole replication mechanism.
Other Considerations
 Clock Synchronization
 Planning to use Chrony, since its the default clock synchronization tool in
recent CentOS releases which is the mostly used Linux distribution.
 Adding extra methods that calculate hard error bounds
Ad

Recommended

Apache Solr: Upgrading Your Upgrade Experience - Hrishikesh Gadre, Lucidworks
Apache Solr: Upgrading Your Upgrade Experience - Hrishikesh Gadre, Lucidworks
Lucidworks
Deep Dive into RDS PostgreSQL Universe
Deep Dive into RDS PostgreSQL Universe
Jignesh Shah
Provisioning and automating high availability postgres on aws ec2 (1)
Provisioning and automating high availability postgres on aws ec2 (1)
Payal Singh
Solr5
Solr5
Phoebe Shih
Alfresco tuning part1
Alfresco tuning part1
Luis Cabaceira
Ansible Devops North East - slides
Ansible Devops North East - slides
InfinityPP
Moving to Nova Cells without Destroying the World
Moving to Nova Cells without Destroying the World
Mike Dorman
Refactoring Katello Installer modules - Ewoud Kohl van Wijngaarden
Refactoring Katello Installer modules - Ewoud Kohl van Wijngaarden
NETWAYS
Linux Kernel Live Patching
Linux Kernel Live Patching
GlobalLogic Ukraine
01 oracle architecture
01 oracle architecture
Smitha Padmanabhan
Introducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes Operator
Flink Forward
SFDC Batch Apex
SFDC Batch Apex
Sujit Kumar
python_development.pptx
python_development.pptx
LemonReddy1
DevOps for database
DevOps for database
Osama Mustafa
Ceph Deployment at Target: Customer Spotlight
Ceph Deployment at Target: Customer Spotlight
Colleen Corrice
Ceph Deployment at Target: Customer Spotlight
Ceph Deployment at Target: Customer Spotlight
Red_Hat_Storage
Application System 400 introduction to a
Application System 400 introduction to a
ManiGopal9
Simplifying EBS 12.2 ADOP - Collaborate 2019
Simplifying EBS 12.2 ADOP - Collaborate 2019
Alfredo Krieg
UNIT -5 EMBEDDED DRIVERS AND APPLICATION PORTING.pptx
UNIT -5 EMBEDDED DRIVERS AND APPLICATION PORTING.pptx
KesavanT10
Sep 6 cdm
Sep 6 cdm
rainman1985
What we've learned from running a PostgreSQL managed service on Kubernetes
What we've learned from running a PostgreSQL managed service on Kubernetes
DoKC
Oracle training institutes in hyderabad
Oracle training institutes in hyderabad
sreehari orienit
Kubernetes Walk Through from Technical View
Kubernetes Walk Through from Technical View
Lei (Harry) Zhang
Whats expected in Java 9
Whats expected in Java 9
Gal Marder
Kernel Module Programming
Kernel Module Programming
Saurabh Bangad
Pune-Cocoa: Blocks and GCD
Pune-Cocoa: Blocks and GCD
Prashant Rane
Tips and Tricks for SAP Sybase ASE
Tips and Tricks for SAP Sybase ASE
Don Brizendine
Oracle Cloud DBaaS
Oracle Cloud DBaaS
Arush Jain
MATERIAL SCIENCE LECTURE NOTES FOR DIPLOMA STUDENTS
MATERIAL SCIENCE LECTURE NOTES FOR DIPLOMA STUDENTS
SAMEER VISHWAKARMA
Introduction to sensing and Week-1.pptx
Introduction to sensing and Week-1.pptx
KNaveenKumarECE

More Related Content

Similar to Feb 7th CDM (20)

Linux Kernel Live Patching
Linux Kernel Live Patching
GlobalLogic Ukraine
01 oracle architecture
01 oracle architecture
Smitha Padmanabhan
Introducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes Operator
Flink Forward
SFDC Batch Apex
SFDC Batch Apex
Sujit Kumar
python_development.pptx
python_development.pptx
LemonReddy1
DevOps for database
DevOps for database
Osama Mustafa
Ceph Deployment at Target: Customer Spotlight
Ceph Deployment at Target: Customer Spotlight
Colleen Corrice
Ceph Deployment at Target: Customer Spotlight
Ceph Deployment at Target: Customer Spotlight
Red_Hat_Storage
Application System 400 introduction to a
Application System 400 introduction to a
ManiGopal9
Simplifying EBS 12.2 ADOP - Collaborate 2019
Simplifying EBS 12.2 ADOP - Collaborate 2019
Alfredo Krieg
UNIT -5 EMBEDDED DRIVERS AND APPLICATION PORTING.pptx
UNIT -5 EMBEDDED DRIVERS AND APPLICATION PORTING.pptx
KesavanT10
Sep 6 cdm
Sep 6 cdm
rainman1985
What we've learned from running a PostgreSQL managed service on Kubernetes
What we've learned from running a PostgreSQL managed service on Kubernetes
DoKC
Oracle training institutes in hyderabad
Oracle training institutes in hyderabad
sreehari orienit
Kubernetes Walk Through from Technical View
Kubernetes Walk Through from Technical View
Lei (Harry) Zhang
Whats expected in Java 9
Whats expected in Java 9
Gal Marder
Kernel Module Programming
Kernel Module Programming
Saurabh Bangad
Pune-Cocoa: Blocks and GCD
Pune-Cocoa: Blocks and GCD
Prashant Rane
Tips and Tricks for SAP Sybase ASE
Tips and Tricks for SAP Sybase ASE
Don Brizendine
Oracle Cloud DBaaS
Oracle Cloud DBaaS
Arush Jain
Introducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes Operator
Flink Forward
SFDC Batch Apex
SFDC Batch Apex
Sujit Kumar
python_development.pptx
python_development.pptx
LemonReddy1
DevOps for database
DevOps for database
Osama Mustafa
Ceph Deployment at Target: Customer Spotlight
Ceph Deployment at Target: Customer Spotlight
Colleen Corrice
Ceph Deployment at Target: Customer Spotlight
Ceph Deployment at Target: Customer Spotlight
Red_Hat_Storage
Application System 400 introduction to a
Application System 400 introduction to a
ManiGopal9
Simplifying EBS 12.2 ADOP - Collaborate 2019
Simplifying EBS 12.2 ADOP - Collaborate 2019
Alfredo Krieg
UNIT -5 EMBEDDED DRIVERS AND APPLICATION PORTING.pptx
UNIT -5 EMBEDDED DRIVERS AND APPLICATION PORTING.pptx
KesavanT10
What we've learned from running a PostgreSQL managed service on Kubernetes
What we've learned from running a PostgreSQL managed service on Kubernetes
DoKC
Oracle training institutes in hyderabad
Oracle training institutes in hyderabad
sreehari orienit
Kubernetes Walk Through from Technical View
Kubernetes Walk Through from Technical View
Lei (Harry) Zhang
Whats expected in Java 9
Whats expected in Java 9
Gal Marder
Kernel Module Programming
Kernel Module Programming
Saurabh Bangad
Pune-Cocoa: Blocks and GCD
Pune-Cocoa: Blocks and GCD
Prashant Rane
Tips and Tricks for SAP Sybase ASE
Tips and Tricks for SAP Sybase ASE
Don Brizendine
Oracle Cloud DBaaS
Oracle Cloud DBaaS
Arush Jain

Recently uploaded (20)

MATERIAL SCIENCE LECTURE NOTES FOR DIPLOMA STUDENTS
MATERIAL SCIENCE LECTURE NOTES FOR DIPLOMA STUDENTS
SAMEER VISHWAKARMA
Introduction to sensing and Week-1.pptx
Introduction to sensing and Week-1.pptx
KNaveenKumarECE
Structural Wonderers_new and ancient.pptx
Structural Wonderers_new and ancient.pptx
nikopapa113
Complete guidance book of Asp.Net Web API
Complete guidance book of Asp.Net Web API
Shabista Imam
Introduction to Natural Language Processing - Stages in NLP Pipeline, Challen...
Introduction to Natural Language Processing - Stages in NLP Pipeline, Challen...
resming1
Rapid Prototyping for XR: Lecture 4 - High Level Prototyping.
Rapid Prototyping for XR: Lecture 4 - High Level Prototyping.
Mark Billinghurst
retina_biometrics ruet rajshahi bangdesh.pptx
retina_biometrics ruet rajshahi bangdesh.pptx
MdRakibulIslam697135
Solar thermal Flat plate and concentrating collectors .pptx
Solar thermal Flat plate and concentrating collectors .pptx
jdaniabraham1
Machine Learning - Classification Algorithms
Machine Learning - Classification Algorithms
resming1
Fatality due to Falls at Working at Height
Fatality due to Falls at Working at Height
ssuserb8994f
System design handwritten notes guidance
System design handwritten notes guidance
Shabista Imam
Rapid Prototyping for XR: Lecture 6 - AI for Prototyping and Research Directi...
Rapid Prototyping for XR: Lecture 6 - AI for Prototyping and Research Directi...
Mark Billinghurst
Mechanical Vibration_MIC 202_iit roorkee.pdf
Mechanical Vibration_MIC 202_iit roorkee.pdf
isahiliitr
Abraham Silberschatz-Operating System Concepts (9th,2012.12).pdf
Abraham Silberschatz-Operating System Concepts (9th,2012.12).pdf
Shabista Imam
Rapid Prototyping for XR: Lecture 5 - Cross Platform Development
Rapid Prototyping for XR: Lecture 5 - Cross Platform Development
Mark Billinghurst
NEW Strengthened Senior High School Gen Math.pptx
NEW Strengthened Senior High School Gen Math.pptx
DaryllWhere
Rapid Prototyping for XR: Lecture 3 - Video and Paper Prototyping
Rapid Prototyping for XR: Lecture 3 - Video and Paper Prototyping
Mark Billinghurst
DESIGN OF REINFORCED CONCRETE ELEMENTS S
DESIGN OF REINFORCED CONCRETE ELEMENTS S
prabhusp8
Learning Types of Machine Learning Supervised Learning Unsupervised UNI...
Learning Types of Machine Learning Supervised Learning Unsupervised UNI...
23Q95A6706
Rapid Prototyping for XR: Lecture 1 Introduction to Prototyping
Rapid Prototyping for XR: Lecture 1 Introduction to Prototyping
Mark Billinghurst
MATERIAL SCIENCE LECTURE NOTES FOR DIPLOMA STUDENTS
MATERIAL SCIENCE LECTURE NOTES FOR DIPLOMA STUDENTS
SAMEER VISHWAKARMA
Introduction to sensing and Week-1.pptx
Introduction to sensing and Week-1.pptx
KNaveenKumarECE
Structural Wonderers_new and ancient.pptx
Structural Wonderers_new and ancient.pptx
nikopapa113
Complete guidance book of Asp.Net Web API
Complete guidance book of Asp.Net Web API
Shabista Imam
Introduction to Natural Language Processing - Stages in NLP Pipeline, Challen...
Introduction to Natural Language Processing - Stages in NLP Pipeline, Challen...
resming1
Rapid Prototyping for XR: Lecture 4 - High Level Prototyping.
Rapid Prototyping for XR: Lecture 4 - High Level Prototyping.
Mark Billinghurst
retina_biometrics ruet rajshahi bangdesh.pptx
retina_biometrics ruet rajshahi bangdesh.pptx
MdRakibulIslam697135
Solar thermal Flat plate and concentrating collectors .pptx
Solar thermal Flat plate and concentrating collectors .pptx
jdaniabraham1
Machine Learning - Classification Algorithms
Machine Learning - Classification Algorithms
resming1
Fatality due to Falls at Working at Height
Fatality due to Falls at Working at Height
ssuserb8994f
System design handwritten notes guidance
System design handwritten notes guidance
Shabista Imam
Rapid Prototyping for XR: Lecture 6 - AI for Prototyping and Research Directi...
Rapid Prototyping for XR: Lecture 6 - AI for Prototyping and Research Directi...
Mark Billinghurst
Mechanical Vibration_MIC 202_iit roorkee.pdf
Mechanical Vibration_MIC 202_iit roorkee.pdf
isahiliitr
Abraham Silberschatz-Operating System Concepts (9th,2012.12).pdf
Abraham Silberschatz-Operating System Concepts (9th,2012.12).pdf
Shabista Imam
Rapid Prototyping for XR: Lecture 5 - Cross Platform Development
Rapid Prototyping for XR: Lecture 5 - Cross Platform Development
Mark Billinghurst
NEW Strengthened Senior High School Gen Math.pptx
NEW Strengthened Senior High School Gen Math.pptx
DaryllWhere
Rapid Prototyping for XR: Lecture 3 - Video and Paper Prototyping
Rapid Prototyping for XR: Lecture 3 - Video and Paper Prototyping
Mark Billinghurst
DESIGN OF REINFORCED CONCRETE ELEMENTS S
DESIGN OF REINFORCED CONCRETE ELEMENTS S
prabhusp8
Learning Types of Machine Learning Supervised Learning Unsupervised UNI...
Learning Types of Machine Learning Supervised Learning Unsupervised UNI...
23Q95A6706
Rapid Prototyping for XR: Lecture 1 Introduction to Prototyping
Rapid Prototyping for XR: Lecture 1 Introduction to Prototyping
Mark Billinghurst
Ad

Feb 7th CDM

  • 2. ISSUES The Overall Architecture Main cluster implementation Backup Cluster implementation
  • 4. The Overall Architecture Implementation Principals: Modify the current system as less as possible and reuse the current system components as much as possible Make as little impact on the performance of other system components as possible
  • 5. ISSUES The Whole Architecture Main cluster implementation Backup Cluster implementation
  • 6. Main Cluster Components Classification of Ops Ops issued by clients(librados ops); Ops issued by other OSDs(repops); Only ops issued by clients needs to be replicated. Main Difficulties: Deal with various situations that replication can not go on. Solutions: later sections Preserve ops order in replication when osdmap changes Solution: Make sure replication journal(FileJournal) entries for the ops needing to be replicated are considered removable only when the corresponding replication is finished During the recovery/backfill phase, make sure an object gets recovered only when all original ops targeting it gets replicated
  • 7. Main Cluster Components REPLICATION SUSPEND If there are, in the acting set of a pg, enough OSDs whose replication ops cache is full, then replication of ops targeting that pg should be suspended, and also, monitors should suspend sendint out snapshot notifications If the backup cluster is replication full, suspend the replication. SNAPSHOT NOTIFICATION SUSPEND When REPLICATION SUSPEND condition is met, snapshot notification should also be suspended; When the system clocks of all OSDs in a pgs acting set are out-of-sync, snapshot notification should be suspended;
  • 8. Main Cluster Components Replication Ops Cache: reuse FileJournal(As for now, we have only considered FileStore) Introduce to journal header: op_replication_head, op_replication_tail; op_replication_tail - op_replication_head journal_replication_threshold. If journal_head hits op_replication_tail, move it to op_replication_head + 1. Only journal entries outside [journal_head,Journal_tail]U[op_replication_head, op_replication_tail] can be removed Introduce to journal entry: need_replicate, indicating that this op needs to be replicated.
  • 9. Main Cluster Components New flags in Object Info: need_replicate: set/unset by clients, indicating whether ops targeting this object need to be replicated; need_full_replication: set/unset by osd, indicating whether there were ops targeting this object not replicated because of "REPLICATION SUSPEND condition. New module ReplicationWorker Conceptually comprised of a cache used to store ops to be replicated and a working thread that continuously replicate those ops targeting pgs of which the current OSD is acting primary. ReplicationWorker remove ops when they are replicated(a commit message is received from OSDs in the backup cluster or a can remove message is received from their acting primaries) and move forward journals pointer op_replication_tail(op_replication_head is moved forward by OSDs work thread) When an op is replicated, ReplicationWorker should also send a can remove message to its pgs other OSDs.
  • 10. Main Cluster Components Replication Robustness New configuration item replication_min_size: the least number of OSDs that should hold the replica of librados ops while the replication of it is not complete yet Replica OSDs should encapsulate a flag indicating whether op_replication_tail is moved forward by a repop in the reply to that repop If less than replication_min_size OSDs has moved forward their op_replication_tail, then the pg should be marked as replicaton full and all subsequent librados ops cant be replicated until that mark is cleared When an OSDs replication ops cache is left with enough space, a message should be sent to all acting primarys who has a common pg with it, when OSDs capable of caching librados ops for replication are enough, the pgs replication full mark is cleared
  • 11. Main Cluster Components Replication Robustness For the purpose of better performance, acting primary doesnt need to wait for the repop reply, all target objects are marked as need_full_replication, and are all cleared of that mark when the librados op is replicated. When a librados ops replication is done, the replication executing OSD send out replication succeeded message for that op to other OSDs in acting set. Replication is done by the first OSD that has enough Replication Ops Cache in the acting set which is actually a list. This decision is made by acting primary. New field in PG INFO: need_full_replication_set: a bloom filter that filters out objects that are definitely not marked as need_full_replication
  • 12. Main Cluster Components Peering/Recovery/Backfill procedure The only thing that need to be add to this procedure is that an objects recovery source should replicate all librados ops before pushing that object
  • 13. ISSUES The Whole Architecture Main cluster implementation Backup Cluster implementation
  • 14. Backup Cluster Implementation Main Difficulty: Assuring efficient replication ops caching and efficient recovery when osdmap changes. Solution: Add an additional processing step for replication op processing between ops journaling and ops applying: raw ops storing In the raw ops storing phase, a three tuple <timestamp, object_id, op> is stored in Rocksdb Ops applying happens when the backup clusters monitor receives an snapshot notification that tells replication ops earlier that certain time point can be merged to the backing store.
  • 15. Backup Cluster Implementation Solution: Adding a raw ops storing phase is for the reason that, when doing recovery/backfill, both unmerged ops and the target object need to be recovered and storing the three tuple <timestamp, object_id, op> in Rocksdb can make the query like what ops need to be recovered for a certain object more efficient. NOTE: only replication ops need the raw ops storing phase.
  • 16. Backup Cluster Implementation In-memory cache(FileStore::op_wq) To make the ops applying more efficient, there should be a new thread that keep feeding the FileStore::op_wq with replication ops stored in Rocksdb in raw ops storing phase. FileStore::op_tp should also distinguish the processing of replication ops and other ops, since replication ops can only be applied when the corresponding snapshot notification is received. Also, replication ops targeting the same object should be merged to the greatest extent before applied to the object, which could make the applying phase more efficient and, further, make the whole replication more efficient since the applying phase could probably be the FINAL bottleneck of the whole replication mechanism.
  • 17. Other Considerations Clock Synchronization Planning to use Chrony, since its the default clock synchronization tool in recent CentOS releases which is the mostly used Linux distribution. Adding extra methods that calculate hard error bounds

Editor's Notes

  • #11: Like the current pool min_size, there should be a configuration item like replication_min_size meaning the least number of OSDs that should hold the replica of librados ops while the replication of it is not complete yet.