際際滷

際際滷Share a Scribd company logo
Ext4 write barrier
Current Ext4 Journaling
 On File System (FS) failure, disk
contents can be corrupted
 Journals keep data consistent
during failure by writing data
twice
 Write (Journal)-> commit -> write
 Failure will cause data to either
exist consistently or not at all
 Ordered Mode only journals
metadata, but ensures data is
written to disk first
Current Ext4 Journaling cont
 Sometimes, we do not need to journal the data, only the metadata:
ie. data corruption is OK, breaking the directory tree is not OK
 Ordered Mode is default,
reduces the amount of
double writing, but allows
data corruption.
 Data mode is very slow
 Unordered mode exists, but
is much more dangerous
Current Ext4 Journaling cont
 Fsync system call explicitly flushes OS
in-memory files to disk through Ext4s
journaling mechanism
 Write barriers then forces a flush-to-disk
call after journal is sent to disk
 This ensures the journal is on
non-volatile disk area (instead of volatile
disk caches)
PROBLEM
 After OTA, SSHD NAND cache is filled with OTA data
 Dex2oat does ahead-of-time compilation for Android apps
 Dex2oat calls fdatasync (similar to fsync) at regular intervals,
causing disk flushes
 Since NAND is full, every fsync causes all dirty data on SSHD Cache
(upto 64MB) to be flushed to platter
 Fsync therefor causes a synchronous IO block, preempting any other
disk reads and writes
 Causes huge amount of sluggishness at user experience side
Disabling write barrier
 Allows disk to reorder cache-to-disk writes
 Does not block disk reads while writes are queued to disk
 Risks:
 On power failure we can not longer ensure journal is consistent, as volatile cache
will be lost
 Since only metadata is journaled, we can potentially introduce filesystem
corruption
 However
 Filesystem metadata is rarely written to compared to data
 Disk drive uses a timeout system for cache-to-disk writes
 Power failures are uncommon as a set top box device
Dex2oat Fsync latency
HDD mounted with barrier
300ms latency
HDD mounted with
nobarrier
105ms latency
Androbench SQLite
HDD mounted with barrier
Transactions Per Second (TPS)
HDD mounted with nobarrier
Transactions Per Second (TPS)
SQLite Fsync latency
EMMC mounted with barrier
860us latency
EMMC mounted with
nobarrier
361us latency
Androbench SQLite
EMMC mounted with barrier
Transactions Per Second (TPS)
EMMC mounted with nobarrier
Transactions Per Second (TPS)

More Related Content

Ext4 write barrier

  • 2. Current Ext4 Journaling On File System (FS) failure, disk contents can be corrupted Journals keep data consistent during failure by writing data twice Write (Journal)-> commit -> write Failure will cause data to either exist consistently or not at all Ordered Mode only journals metadata, but ensures data is written to disk first
  • 3. Current Ext4 Journaling cont Sometimes, we do not need to journal the data, only the metadata: ie. data corruption is OK, breaking the directory tree is not OK Ordered Mode is default, reduces the amount of double writing, but allows data corruption. Data mode is very slow Unordered mode exists, but is much more dangerous
  • 4. Current Ext4 Journaling cont Fsync system call explicitly flushes OS in-memory files to disk through Ext4s journaling mechanism Write barriers then forces a flush-to-disk call after journal is sent to disk This ensures the journal is on non-volatile disk area (instead of volatile disk caches)
  • 5. PROBLEM After OTA, SSHD NAND cache is filled with OTA data Dex2oat does ahead-of-time compilation for Android apps Dex2oat calls fdatasync (similar to fsync) at regular intervals, causing disk flushes Since NAND is full, every fsync causes all dirty data on SSHD Cache (upto 64MB) to be flushed to platter Fsync therefor causes a synchronous IO block, preempting any other disk reads and writes Causes huge amount of sluggishness at user experience side
  • 6. Disabling write barrier Allows disk to reorder cache-to-disk writes Does not block disk reads while writes are queued to disk Risks: On power failure we can not longer ensure journal is consistent, as volatile cache will be lost Since only metadata is journaled, we can potentially introduce filesystem corruption However Filesystem metadata is rarely written to compared to data Disk drive uses a timeout system for cache-to-disk writes Power failures are uncommon as a set top box device
  • 7. Dex2oat Fsync latency HDD mounted with barrier 300ms latency HDD mounted with nobarrier 105ms latency
  • 8. Androbench SQLite HDD mounted with barrier Transactions Per Second (TPS) HDD mounted with nobarrier Transactions Per Second (TPS)
  • 9. SQLite Fsync latency EMMC mounted with barrier 860us latency EMMC mounted with nobarrier 361us latency
  • 10. Androbench SQLite EMMC mounted with barrier Transactions Per Second (TPS) EMMC mounted with nobarrier Transactions Per Second (TPS)