際際滷

際際滷Share a Scribd company logo
Issue
 10-20 Millions object per devices
 50 millions inodes per devices
 36 devices per server
 64 GB of RAM
 1 inode is 1KB in RAM
 Would need 1.75TB of RAM for caching all inodes
 75油% cache miss on inodes
 Up to 50油% of IO to get inodes from device
 (replicator/reconstructor constantly scan device...)
Solution
 Get rid of inodes
 Haystack-like solution
 Objects in volumes (a.k.a. big files, 5GB or 10GB)
 K/V store to map object to (volume id, position)
 K/V is an gRPC service
 Backed by LevelDB (for now...)
 Need to avoid compaction issue
 fallocate(PUNCH_HOLE)
 Smart selection of volumes
Benefits
 42 bytes per object in K/V
 Compared to 1KB for an XFS inode
 Fit in memory (20GB vs 1.75TB)
 Should easily go down to 30 bytes per object
 Listdir happens in K/V (so in memory)
 Space efficiency vs Block aligned (!)
 Flat namespace for objects
 No part/sfx/ohash
 Increasing part power is just a ring thing
Adding an object
1.Select a volume
2.Append objet data
1.Object header (magic string, ohash, size, )
2.Object metadata
3.Object data
3.fdatasync() volume
4.Insert new entry in K/V (no transaction)
 <o><policy><ohash><filename> => <volume id><offset>
=> If crash, the volume act as a journal to replay
Removing an object
1.Select a volume
2.Insert a tombstone
3.fdatasync() volume
4.Insert tombstone in K/V
5.Run cleanup_ondisk_files()
1.Punch_hole the object
2.Remove the old entry from K/V
Volume selection
 Avoid holes in volumes to reduce compaction
 Try to group objects by partition
 => rebalance is compaction
 Put short life objects in dedicated volumes
 tombstone
 x-delete-at soon
 Dedicated volumes for handoff?
Benchmarks
 Atom C2750 2.40Ghz
 16GB RAM
 HGST HUS726040ALA610 (4TB)
 Directly connecting to objet servers
Benchmarks
 Single threaded PUT (100 bytes objects)
 From 0 to 4 millions objects
 XFS油: 19.8/s
 Volumes油: 26.2/s
 From 4 millions to 8 millions objects
 XFS油: 17/s
 Volumes油: 39.2/s (b/c of not creating more volumes?)
 What we see (need numbers!)
 XFS油: memory is full油; Volumes油: memory is free
 Disks is more busy with XFS
Benchmarks
 Single threaded random GET
 XFS油: 39/s
 Volumes油: 93/s
Benchmarks
 Concurrent PUT, 20 threads for 10 minutes
avg 50% 95% 99% max
XFS 641ms 67ms 3.5s 4.7s 5.9s
Volumes 82ms 50ms 261ms 615ms 1.24s
Status
 Done
 HEAD/GET/PUT/DELETE/POST (replica)
 Todo
 REPLICATE/SSYNC
 Erasure Code
 XFS read compatibility
 Smarter volumes selection
 Func tests on object servers (is there any?)
 Doc

More Related Content

What's hot (20)

Fedora Virtualization Day: Linux Containers & CRIU
Fedora Virtualization Day: Linux Containers & CRIUFedora Virtualization Day: Linux Containers & CRIU
Fedora Virtualization Day: Linux Containers & CRIU
Andrey Vagin
Show Me the Garbage!, Understanding Garbage Collection
Show Me the Garbage!, Understanding Garbage CollectionShow Me the Garbage!, Understanding Garbage Collection
Show Me the Garbage!, Understanding Garbage Collection
Haim Yadid
Mongodb meetup
Mongodb meetupMongodb meetup
Mongodb meetup
Eytan Daniyalzade
nebulaconf
nebulaconfnebulaconf
nebulaconf
Pedro Dias
C* Summit 2013: Time-Series Metrics with Cassandra by Mike Heffner
C* Summit 2013: Time-Series Metrics with Cassandra by Mike HeffnerC* Summit 2013: Time-Series Metrics with Cassandra by Mike Heffner
C* Summit 2013: Time-Series Metrics with Cassandra by Mike Heffner
DataStax Academy
舒仗亠亟亠仍亠仆仆亠 亳亠仄 舒仆亠仆亳 亟舒仆仆, 仂仂弍亠仆仆仂亳 亠舒仍亳亰舒亳亳 DHT 于 仗仂亠从亠 ...
舒仗亠亟亠仍亠仆仆亠 亳亠仄 舒仆亠仆亳 亟舒仆仆, 仂仂弍亠仆仆仂亳 亠舒仍亳亰舒亳亳 DHT 于 仗仂亠从亠 ...舒仗亠亟亠仍亠仆仆亠 亳亠仄 舒仆亠仆亳 亟舒仆仆, 仂仂弍亠仆仆仂亳 亠舒仍亳亰舒亳亳 DHT 于 仗仂亠从亠 ...
舒仗亠亟亠仍亠仆仆亠 亳亠仄 舒仆亠仆亳 亟舒仆仆, 仂仂弍亠仆仆仂亳 亠舒仍亳亰舒亳亳 DHT 于 仗仂亠从亠 ...
yaevents
Be a Zen monk, the Python way
Be a Zen monk, the Python wayBe a Zen monk, the Python way
Be a Zen monk, the Python way
Sriram Murali
CRIU: time and space travel for Linux containers -- Kir Kolyshkin
CRIU: time and space travel for Linux containers -- Kir KolyshkinCRIU: time and space travel for Linux containers -- Kir Kolyshkin
CRIU: time and space travel for Linux containers -- Kir Kolyshkin
OpenVZ
tokyotalk
tokyotalktokyotalk
tokyotalk
Hiroshi Ono
.NET Memory Primer
.NET Memory Primer.NET Memory Primer
.NET Memory Primer
Martin Kulov
Ceph Day NYC: Developing With Librados
Ceph Day NYC: Developing With LibradosCeph Day NYC: Developing With Librados
Ceph Day NYC: Developing With Librados
Ceph Community
Lab 01 03_16
Lab 01 03_16Lab 01 03_16
Lab 01 03_16
Hao Wu
Remora the another asdf.
Remora the another asdf.Remora the another asdf.
Remora the another asdf.
hyotang666
Memory management
Memory managementMemory management
Memory management
mitesh_sharma
Garbage collection
Garbage collectionGarbage collection
Garbage collection
Mudit Gupta
An Introduction to Priam
An Introduction to PriamAn Introduction to Priam
An Introduction to Priam
Jason Brown
SqliteToRealm
SqliteToRealmSqliteToRealm
SqliteToRealm
Pluu love
A survey on Heap Exploitation
A survey on Heap Exploitation A survey on Heap Exploitation
A survey on Heap Exploitation
Alireza Karimi
FOSDEM2015: Live migration for containers is around the corner
FOSDEM2015: Live migration for containers is around the cornerFOSDEM2015: Live migration for containers is around the corner
FOSDEM2015: Live migration for containers is around the corner
Andrey Vagin
Mongo nyc nyt + mongodb
Mongo nyc nyt + mongodbMongo nyc nyt + mongodb
Mongo nyc nyt + mongodb
Deep Kapadia
Fedora Virtualization Day: Linux Containers & CRIU
Fedora Virtualization Day: Linux Containers & CRIUFedora Virtualization Day: Linux Containers & CRIU
Fedora Virtualization Day: Linux Containers & CRIU
Andrey Vagin
Show Me the Garbage!, Understanding Garbage Collection
Show Me the Garbage!, Understanding Garbage CollectionShow Me the Garbage!, Understanding Garbage Collection
Show Me the Garbage!, Understanding Garbage Collection
Haim Yadid
C* Summit 2013: Time-Series Metrics with Cassandra by Mike Heffner
C* Summit 2013: Time-Series Metrics with Cassandra by Mike HeffnerC* Summit 2013: Time-Series Metrics with Cassandra by Mike Heffner
C* Summit 2013: Time-Series Metrics with Cassandra by Mike Heffner
DataStax Academy
舒仗亠亟亠仍亠仆仆亠 亳亠仄 舒仆亠仆亳 亟舒仆仆, 仂仂弍亠仆仆仂亳 亠舒仍亳亰舒亳亳 DHT 于 仗仂亠从亠 ...
舒仗亠亟亠仍亠仆仆亠 亳亠仄 舒仆亠仆亳 亟舒仆仆, 仂仂弍亠仆仆仂亳 亠舒仍亳亰舒亳亳 DHT 于 仗仂亠从亠 ...舒仗亠亟亠仍亠仆仆亠 亳亠仄 舒仆亠仆亳 亟舒仆仆, 仂仂弍亠仆仆仂亳 亠舒仍亳亰舒亳亳 DHT 于 仗仂亠从亠 ...
舒仗亠亟亠仍亠仆仆亠 亳亠仄 舒仆亠仆亳 亟舒仆仆, 仂仂弍亠仆仆仂亳 亠舒仍亳亰舒亳亳 DHT 于 仗仂亠从亠 ...
yaevents
Be a Zen monk, the Python way
Be a Zen monk, the Python wayBe a Zen monk, the Python way
Be a Zen monk, the Python way
Sriram Murali
CRIU: time and space travel for Linux containers -- Kir Kolyshkin
CRIU: time and space travel for Linux containers -- Kir KolyshkinCRIU: time and space travel for Linux containers -- Kir Kolyshkin
CRIU: time and space travel for Linux containers -- Kir Kolyshkin
OpenVZ
.NET Memory Primer
.NET Memory Primer.NET Memory Primer
.NET Memory Primer
Martin Kulov
Ceph Day NYC: Developing With Librados
Ceph Day NYC: Developing With LibradosCeph Day NYC: Developing With Librados
Ceph Day NYC: Developing With Librados
Ceph Community
Lab 01 03_16
Lab 01 03_16Lab 01 03_16
Lab 01 03_16
Hao Wu
Remora the another asdf.
Remora the another asdf.Remora the another asdf.
Remora the another asdf.
hyotang666
Memory management
Memory managementMemory management
Memory management
mitesh_sharma
Garbage collection
Garbage collectionGarbage collection
Garbage collection
Mudit Gupta
An Introduction to Priam
An Introduction to PriamAn Introduction to Priam
An Introduction to Priam
Jason Brown
SqliteToRealm
SqliteToRealmSqliteToRealm
SqliteToRealm
Pluu love
A survey on Heap Exploitation
A survey on Heap Exploitation A survey on Heap Exploitation
A survey on Heap Exploitation
Alireza Karimi
FOSDEM2015: Live migration for containers is around the corner
FOSDEM2015: Live migration for containers is around the cornerFOSDEM2015: Live migration for containers is around the corner
FOSDEM2015: Live migration for containers is around the corner
Andrey Vagin
Mongo nyc nyt + mongodb
Mongo nyc nyt + mongodbMongo nyc nyt + mongodb
Mongo nyc nyt + mongodb
Deep Kapadia

Similar to 際際滷 smallfiles (20)

Ippevent : openshift Introduction
Ippevent : openshift IntroductionIppevent : openshift Introduction
Ippevent : openshift Introduction
kanedafromparis
Bluestore
BluestoreBluestore
Bluestore
Ceph Community
Bluestore
BluestoreBluestore
Bluestore
Patrick McGarry
Ceph scale testing with 10 Billion Objects
Ceph scale testing with 10 Billion ObjectsCeph scale testing with 10 Billion Objects
Ceph scale testing with 10 Billion Objects
Karan Singh
BlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephBlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for Ceph
Sage Weil
Show Me the Garbage!, Garbage Collection a Friend or a Foe
Show Me the Garbage!, Garbage Collection a Friend or a FoeShow Me the Garbage!, Garbage Collection a Friend or a Foe
Show Me the Garbage!, Garbage Collection a Friend or a Foe
Haim Yadid
Let's talk about Garbage Collection
Let's talk about Garbage CollectionLet's talk about Garbage Collection
Let's talk about Garbage Collection
Haim Yadid
BlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephBlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for Ceph
Sage Weil
Ceph Tech Talk: Bluestore
Ceph Tech Talk: BluestoreCeph Tech Talk: Bluestore
Ceph Tech Talk: Bluestore
Ceph Community
Ceph and RocksDB
Ceph and RocksDBCeph and RocksDB
Ceph and RocksDB
Sage Weil
The Hive Think Tank: Ceph + RocksDB by Sage Weil, Red Hat.
The Hive Think Tank: Ceph + RocksDB by Sage Weil, Red Hat.The Hive Think Tank: Ceph + RocksDB by Sage Weil, Red Hat.
The Hive Think Tank: Ceph + RocksDB by Sage Weil, Red Hat.
The Hive
BlueStore, A New Storage Backend for Ceph, One Year In
BlueStore, A New Storage Backend for Ceph, One Year InBlueStore, A New Storage Backend for Ceph, One Year In
BlueStore, A New Storage Backend for Ceph, One Year In
Sage Weil
An Efficient Backup and Replication of Storage
An Efficient Backup and Replication of StorageAn Efficient Backup and Replication of Storage
An Efficient Backup and Replication of Storage
Takashi Hoshino
Hoard_2022AIM1001.pptx.pdf
Hoard_2022AIM1001.pptx.pdfHoard_2022AIM1001.pptx.pdf
Hoard_2022AIM1001.pptx.pdf
AshutoshKumar437302
Computer Memory Hierarchy Computer Architecture
Computer Memory Hierarchy Computer ArchitectureComputer Memory Hierarchy Computer Architecture
Computer Memory Hierarchy Computer Architecture
Haris456
SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...
SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...
SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...
Fred de Villamil
Flink Forward Berlin 2017: Robert Metzger - Keep it going - How to reliably a...
Flink Forward Berlin 2017: Robert Metzger - Keep it going - How to reliably a...Flink Forward Berlin 2017: Robert Metzger - Keep it going - How to reliably a...
Flink Forward Berlin 2017: Robert Metzger - Keep it going - How to reliably a...
Flink Forward
TritonSort: A Balanced Large-Scale Sorting System (NSDI 2011)
TritonSort: A Balanced Large-Scale Sorting System (NSDI 2011)TritonSort: A Balanced Large-Scale Sorting System (NSDI 2011)
TritonSort: A Balanced Large-Scale Sorting System (NSDI 2011)
Alex Rasmussen
Save Java memory
Save Java memorySave Java memory
Save Java memory
JavaDayUA
.NET Core, ASP.NET Core Course, Session 4
.NET Core, ASP.NET Core Course, Session 4.NET Core, ASP.NET Core Course, Session 4
.NET Core, ASP.NET Core Course, Session 4
Amin Mesbahi
Ippevent : openshift Introduction
Ippevent : openshift IntroductionIppevent : openshift Introduction
Ippevent : openshift Introduction
kanedafromparis
Ceph scale testing with 10 Billion Objects
Ceph scale testing with 10 Billion ObjectsCeph scale testing with 10 Billion Objects
Ceph scale testing with 10 Billion Objects
Karan Singh
BlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephBlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for Ceph
Sage Weil
Show Me the Garbage!, Garbage Collection a Friend or a Foe
Show Me the Garbage!, Garbage Collection a Friend or a FoeShow Me the Garbage!, Garbage Collection a Friend or a Foe
Show Me the Garbage!, Garbage Collection a Friend or a Foe
Haim Yadid
Let's talk about Garbage Collection
Let's talk about Garbage CollectionLet's talk about Garbage Collection
Let's talk about Garbage Collection
Haim Yadid
BlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephBlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for Ceph
Sage Weil
Ceph Tech Talk: Bluestore
Ceph Tech Talk: BluestoreCeph Tech Talk: Bluestore
Ceph Tech Talk: Bluestore
Ceph Community
Ceph and RocksDB
Ceph and RocksDBCeph and RocksDB
Ceph and RocksDB
Sage Weil
The Hive Think Tank: Ceph + RocksDB by Sage Weil, Red Hat.
The Hive Think Tank: Ceph + RocksDB by Sage Weil, Red Hat.The Hive Think Tank: Ceph + RocksDB by Sage Weil, Red Hat.
The Hive Think Tank: Ceph + RocksDB by Sage Weil, Red Hat.
The Hive
BlueStore, A New Storage Backend for Ceph, One Year In
BlueStore, A New Storage Backend for Ceph, One Year InBlueStore, A New Storage Backend for Ceph, One Year In
BlueStore, A New Storage Backend for Ceph, One Year In
Sage Weil
An Efficient Backup and Replication of Storage
An Efficient Backup and Replication of StorageAn Efficient Backup and Replication of Storage
An Efficient Backup and Replication of Storage
Takashi Hoshino
Computer Memory Hierarchy Computer Architecture
Computer Memory Hierarchy Computer ArchitectureComputer Memory Hierarchy Computer Architecture
Computer Memory Hierarchy Computer Architecture
Haris456
SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...
SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...
SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...
Fred de Villamil
Flink Forward Berlin 2017: Robert Metzger - Keep it going - How to reliably a...
Flink Forward Berlin 2017: Robert Metzger - Keep it going - How to reliably a...Flink Forward Berlin 2017: Robert Metzger - Keep it going - How to reliably a...
Flink Forward Berlin 2017: Robert Metzger - Keep it going - How to reliably a...
Flink Forward
TritonSort: A Balanced Large-Scale Sorting System (NSDI 2011)
TritonSort: A Balanced Large-Scale Sorting System (NSDI 2011)TritonSort: A Balanced Large-Scale Sorting System (NSDI 2011)
TritonSort: A Balanced Large-Scale Sorting System (NSDI 2011)
Alex Rasmussen
Save Java memory
Save Java memorySave Java memory
Save Java memory
JavaDayUA
.NET Core, ASP.NET Core Course, Session 4
.NET Core, ASP.NET Core Course, Session 4.NET Core, ASP.NET Core Course, Session 4
.NET Core, ASP.NET Core Course, Session 4
Amin Mesbahi

Recently uploaded (20)

study of impact behaviour of dual material for energy absorption
study of impact behaviour of dual material for energy absorptionstudy of impact behaviour of dual material for energy absorption
study of impact behaviour of dual material for energy absorption
AmitChauhan352669
chapter 6.Construction project scheduling techniques pdf
chapter 6.Construction project scheduling techniques pdfchapter 6.Construction project scheduling techniques pdf
chapter 6.Construction project scheduling techniques pdf
Jimma Technology Institute,Jimma University
Matrices for Bid No Bid for estimation.pptx
Matrices for Bid No Bid for estimation.pptxMatrices for Bid No Bid for estimation.pptx
Matrices for Bid No Bid for estimation.pptx
AshishshesharajBhale
Smart Manufacturing with Unified Namespace
Smart Manufacturing with Unified NamespaceSmart Manufacturing with Unified Namespace
Smart Manufacturing with Unified Namespace
Ponraj RK
22PCOAM16 Unit 2 Session 13 Radial Basis Functions and Splines.pptx
22PCOAM16 Unit 2 Session 13 Radial Basis Functions and Splines.pptx22PCOAM16 Unit 2 Session 13 Radial Basis Functions and Splines.pptx
22PCOAM16 Unit 2 Session 13 Radial Basis Functions and Splines.pptx
Guru Nanak Technical Institutions
Pertemuan 4 (Introduction of Sedimentary rock).pptx
Pertemuan 4 (Introduction of Sedimentary rock).pptxPertemuan 4 (Introduction of Sedimentary rock).pptx
Pertemuan 4 (Introduction of Sedimentary rock).pptx
andreahasbullah
Fault_Detection_Using_ANNs_Presentation.pptx
Fault_Detection_Using_ANNs_Presentation.pptxFault_Detection_Using_ANNs_Presentation.pptx
Fault_Detection_Using_ANNs_Presentation.pptx
JeveshMagnani
METAL OXIDE FIELD EFFECT SEMICONDUCTOR-MOSFET
METAL OXIDE FIELD EFFECT SEMICONDUCTOR-MOSFETMETAL OXIDE FIELD EFFECT SEMICONDUCTOR-MOSFET
METAL OXIDE FIELD EFFECT SEMICONDUCTOR-MOSFET
punithaece
Operations Management - Facility Location.pptx
Operations Management - Facility Location.pptxOperations Management - Facility Location.pptx
Operations Management - Facility Location.pptx
VirajPasare
Vernier Calipers an details discussion to measure the
Vernier Calipers an details discussion to measure theVernier Calipers an details discussion to measure the
Vernier Calipers an details discussion to measure the
Dr Mohd Aslam
Chapter 2.pdf Smith Chart and Impedance Matching
Chapter 2.pdf Smith Chart and Impedance MatchingChapter 2.pdf Smith Chart and Impedance Matching
Chapter 2.pdf Smith Chart and Impedance Matching
dathoang3243
Chemical-Process-Safety-Fundamentals-With-Applications-2nd-Ed..ppt
Chemical-Process-Safety-Fundamentals-With-Applications-2nd-Ed..pptChemical-Process-Safety-Fundamentals-With-Applications-2nd-Ed..ppt
Chemical-Process-Safety-Fundamentals-With-Applications-2nd-Ed..ppt
Chemical Engineering Dept. NIT Rourkela-769008, Odisha, India
lecture 4MORTAR for construction works(2).ppt
lecture 4MORTAR for construction works(2).pptlecture 4MORTAR for construction works(2).ppt
lecture 4MORTAR for construction works(2).ppt
SimeonWoyesa
Environmental impact assessments ppt.doc
Environmental impact assessments ppt.docEnvironmental impact assessments ppt.doc
Environmental impact assessments ppt.doc
BonsaHailu
Introduction-to-Stack-Instruction.pptxpptppt
Introduction-to-Stack-Instruction.pptxpptpptIntroduction-to-Stack-Instruction.pptxpptppt
Introduction-to-Stack-Instruction.pptxpptppt
suhas060606
J111111111111111111111111111111111111111query.pptx
J111111111111111111111111111111111111111query.pptxJ111111111111111111111111111111111111111query.pptx
J111111111111111111111111111111111111111query.pptx
dkmishra2407
Angelika Dorosz - BIM School Expert course - Poland, Krakow
Angelika Dorosz - BIM School Expert course - Poland, KrakowAngelika Dorosz - BIM School Expert course - Poland, Krakow
Angelika Dorosz - BIM School Expert course - Poland, Krakow
bim.edu.pl
The Mumbai Metropolitan Region Development Authority (MMRDA) was established ...
The Mumbai Metropolitan Region Development Authority (MMRDA) was established ...The Mumbai Metropolitan Region Development Authority (MMRDA) was established ...
The Mumbai Metropolitan Region Development Authority (MMRDA) was established ...
ivargarud
Mastering Algorithms: Time and Space Complexity in Algorithms
Mastering Algorithms: Time and Space Complexity in AlgorithmsMastering Algorithms: Time and Space Complexity in Algorithms
Mastering Algorithms: Time and Space Complexity in Algorithms
Anusha10399
ELECTROMECHANICAL ENERGY CONVERSION PROCESS BY LAKSHYA PANDEY.pptx
ELECTROMECHANICAL ENERGY CONVERSION PROCESS BY LAKSHYA PANDEY.pptxELECTROMECHANICAL ENERGY CONVERSION PROCESS BY LAKSHYA PANDEY.pptx
ELECTROMECHANICAL ENERGY CONVERSION PROCESS BY LAKSHYA PANDEY.pptx
Lakshya Pandey
study of impact behaviour of dual material for energy absorption
study of impact behaviour of dual material for energy absorptionstudy of impact behaviour of dual material for energy absorption
study of impact behaviour of dual material for energy absorption
AmitChauhan352669
Matrices for Bid No Bid for estimation.pptx
Matrices for Bid No Bid for estimation.pptxMatrices for Bid No Bid for estimation.pptx
Matrices for Bid No Bid for estimation.pptx
AshishshesharajBhale
Smart Manufacturing with Unified Namespace
Smart Manufacturing with Unified NamespaceSmart Manufacturing with Unified Namespace
Smart Manufacturing with Unified Namespace
Ponraj RK
22PCOAM16 Unit 2 Session 13 Radial Basis Functions and Splines.pptx
22PCOAM16 Unit 2 Session 13 Radial Basis Functions and Splines.pptx22PCOAM16 Unit 2 Session 13 Radial Basis Functions and Splines.pptx
22PCOAM16 Unit 2 Session 13 Radial Basis Functions and Splines.pptx
Guru Nanak Technical Institutions
Pertemuan 4 (Introduction of Sedimentary rock).pptx
Pertemuan 4 (Introduction of Sedimentary rock).pptxPertemuan 4 (Introduction of Sedimentary rock).pptx
Pertemuan 4 (Introduction of Sedimentary rock).pptx
andreahasbullah
Fault_Detection_Using_ANNs_Presentation.pptx
Fault_Detection_Using_ANNs_Presentation.pptxFault_Detection_Using_ANNs_Presentation.pptx
Fault_Detection_Using_ANNs_Presentation.pptx
JeveshMagnani
METAL OXIDE FIELD EFFECT SEMICONDUCTOR-MOSFET
METAL OXIDE FIELD EFFECT SEMICONDUCTOR-MOSFETMETAL OXIDE FIELD EFFECT SEMICONDUCTOR-MOSFET
METAL OXIDE FIELD EFFECT SEMICONDUCTOR-MOSFET
punithaece
Operations Management - Facility Location.pptx
Operations Management - Facility Location.pptxOperations Management - Facility Location.pptx
Operations Management - Facility Location.pptx
VirajPasare
Vernier Calipers an details discussion to measure the
Vernier Calipers an details discussion to measure theVernier Calipers an details discussion to measure the
Vernier Calipers an details discussion to measure the
Dr Mohd Aslam
Chapter 2.pdf Smith Chart and Impedance Matching
Chapter 2.pdf Smith Chart and Impedance MatchingChapter 2.pdf Smith Chart and Impedance Matching
Chapter 2.pdf Smith Chart and Impedance Matching
dathoang3243
lecture 4MORTAR for construction works(2).ppt
lecture 4MORTAR for construction works(2).pptlecture 4MORTAR for construction works(2).ppt
lecture 4MORTAR for construction works(2).ppt
SimeonWoyesa
Environmental impact assessments ppt.doc
Environmental impact assessments ppt.docEnvironmental impact assessments ppt.doc
Environmental impact assessments ppt.doc
BonsaHailu
Introduction-to-Stack-Instruction.pptxpptppt
Introduction-to-Stack-Instruction.pptxpptpptIntroduction-to-Stack-Instruction.pptxpptppt
Introduction-to-Stack-Instruction.pptxpptppt
suhas060606
J111111111111111111111111111111111111111query.pptx
J111111111111111111111111111111111111111query.pptxJ111111111111111111111111111111111111111query.pptx
J111111111111111111111111111111111111111query.pptx
dkmishra2407
Angelika Dorosz - BIM School Expert course - Poland, Krakow
Angelika Dorosz - BIM School Expert course - Poland, KrakowAngelika Dorosz - BIM School Expert course - Poland, Krakow
Angelika Dorosz - BIM School Expert course - Poland, Krakow
bim.edu.pl
The Mumbai Metropolitan Region Development Authority (MMRDA) was established ...
The Mumbai Metropolitan Region Development Authority (MMRDA) was established ...The Mumbai Metropolitan Region Development Authority (MMRDA) was established ...
The Mumbai Metropolitan Region Development Authority (MMRDA) was established ...
ivargarud
Mastering Algorithms: Time and Space Complexity in Algorithms
Mastering Algorithms: Time and Space Complexity in AlgorithmsMastering Algorithms: Time and Space Complexity in Algorithms
Mastering Algorithms: Time and Space Complexity in Algorithms
Anusha10399
ELECTROMECHANICAL ENERGY CONVERSION PROCESS BY LAKSHYA PANDEY.pptx
ELECTROMECHANICAL ENERGY CONVERSION PROCESS BY LAKSHYA PANDEY.pptxELECTROMECHANICAL ENERGY CONVERSION PROCESS BY LAKSHYA PANDEY.pptx
ELECTROMECHANICAL ENERGY CONVERSION PROCESS BY LAKSHYA PANDEY.pptx
Lakshya Pandey

際際滷 smallfiles

  • 1. Issue 10-20 Millions object per devices 50 millions inodes per devices 36 devices per server 64 GB of RAM 1 inode is 1KB in RAM Would need 1.75TB of RAM for caching all inodes 75油% cache miss on inodes Up to 50油% of IO to get inodes from device (replicator/reconstructor constantly scan device...)
  • 2. Solution Get rid of inodes Haystack-like solution Objects in volumes (a.k.a. big files, 5GB or 10GB) K/V store to map object to (volume id, position) K/V is an gRPC service Backed by LevelDB (for now...) Need to avoid compaction issue fallocate(PUNCH_HOLE) Smart selection of volumes
  • 3. Benefits 42 bytes per object in K/V Compared to 1KB for an XFS inode Fit in memory (20GB vs 1.75TB) Should easily go down to 30 bytes per object Listdir happens in K/V (so in memory) Space efficiency vs Block aligned (!) Flat namespace for objects No part/sfx/ohash Increasing part power is just a ring thing
  • 4. Adding an object 1.Select a volume 2.Append objet data 1.Object header (magic string, ohash, size, ) 2.Object metadata 3.Object data 3.fdatasync() volume 4.Insert new entry in K/V (no transaction) <o><policy><ohash><filename> => <volume id><offset> => If crash, the volume act as a journal to replay
  • 5. Removing an object 1.Select a volume 2.Insert a tombstone 3.fdatasync() volume 4.Insert tombstone in K/V 5.Run cleanup_ondisk_files() 1.Punch_hole the object 2.Remove the old entry from K/V
  • 6. Volume selection Avoid holes in volumes to reduce compaction Try to group objects by partition => rebalance is compaction Put short life objects in dedicated volumes tombstone x-delete-at soon Dedicated volumes for handoff?
  • 7. Benchmarks Atom C2750 2.40Ghz 16GB RAM HGST HUS726040ALA610 (4TB) Directly connecting to objet servers
  • 8. Benchmarks Single threaded PUT (100 bytes objects) From 0 to 4 millions objects XFS油: 19.8/s Volumes油: 26.2/s From 4 millions to 8 millions objects XFS油: 17/s Volumes油: 39.2/s (b/c of not creating more volumes?) What we see (need numbers!) XFS油: memory is full油; Volumes油: memory is free Disks is more busy with XFS
  • 9. Benchmarks Single threaded random GET XFS油: 39/s Volumes油: 93/s
  • 10. Benchmarks Concurrent PUT, 20 threads for 10 minutes avg 50% 95% 99% max XFS 641ms 67ms 3.5s 4.7s 5.9s Volumes 82ms 50ms 261ms 615ms 1.24s
  • 11. Status Done HEAD/GET/PUT/DELETE/POST (replica) Todo REPLICATE/SSYNC Erasure Code XFS read compatibility Smarter volumes selection Func tests on object servers (is there any?) Doc