�ݺ�ߣ

The TileDB Array Data Storage
Manager
Authors:
Stavros Papadopoulos, Kushal Datta, Samuel Madden and
Timothy Mattson
Published in:
Journal Proceedings of the VLDB Endowment Vol. 10 Issue 4,
November 2016
Reviewer: Olamide Timothy Tawose

Preamble
Many scientific and engineering fields generate enormous amounts of data
through measurement, simulation, and experimentation. Examples of data
produced in such fields include astronomical images, DNA sequences, geo-
locations, social relationships, and so on.
All of these data are naturally represented as multi-dimensional arrays, which
can be either dense (when every array element has a value) or sparse (when
the majority of the array elements are empty, i.e., zero or null).
Scientific array data can be very large, containing billions of non-null values
that do not readily fit into the memory of a single or even multiple machines.
Simply storing arrays as files forces application programmers to handle many
issues, including array representation on disk (i.e., sparse vs. dense layouts),
compression, parallel access, and performance.
Alternatively, these issues can be handled by optimized, special-purpose
array data storage management systems, which perform complex analytics
on scientific data. Central to such systems are efficient data access primitives
to read and write arrays.

Objectives
The authors of this work developed a novel storage manager for
multi-dimensional arrays for the scientific data management
system called TileDB.
In contrast to existing solutions, TileDB is optimized for both
dense and sparse arrays. Its key idea is to organize array
elements into ordered collections called fragments.
TileDB array manager is the first truly optimized storage
manager for both forms of multi-dimensional arrays (i.e. dense
array and sparse array).

Existing Array Management Systems
 HDF5: HDF5 is a well-known array data storage manager. It is
a dense array format. HDF5 groups array elements into
regular hyperrectangles, called chunks, which are stored on
the disk in binary format in a single large file.
Shortcomings:
i) It does not efficiently capture sparse arrays.
ii) HDF5 is optimized for in-place writes of large blocks. In-place
updates result in poor performance when writing small blocks
of elements that are randomly distributed, due to the
expensive random disk accesses they incur.

 Parallel HDF5: It is a parallel version of HDF5 with some
additional limitations.
Shortcomings:
i) it does not allow concurrent writes to compressed data.
ii) it does not support variable-length element values.
iii) operation atomicity requires some coding effort from the user,
and imposes extra ordering semantics
 Vertica: Alternatively, relational databases (e.g Vertica) have
been used as the storage backend for array management
(e.g., in RAM for dense and SRAM for sparse arrays), storing
non-empty elements as records and explicitly encoding the
element indices as extra table columns. However, this also
leads to poor performance in the case of dense arrays when
explicitly encoded.

 SciDB: It is a popular array database, which however also has
significant storage management limitations.
Shortcomings:
i) It is not truly designed for sparse arrays as it relies on regular
dimensional chunking (similar to HDF5).
ii) It also requires reading an entire chunk even if a small portion
of it is requested, and updating an entire chunk even when a
single element changes, leading to poor performance in the
case of random writes.

Overview of TileDB
 DataModel: TileDB provides an array-oriented data model that is similar to that of SciDB. An
array consists of dimensions and attributes. Each dimension has a domain, and all dimension
domains collectively orient the logical space of the array. A cell may either be empty (null) or
contain a tuple of attribute values. Each attribute is either a primitive type (int, float, or char), or
a fixed or variable-sized vector of a primitive type. Arrays in TileDB can be dense or sparse. All
TileDB operations can be performed on either type of array, but the choice of dense vs. sparse
can significantly impact application performance
 Figure below shows two example arrays, one dense one sparse, which have two dimensions,
rows and columns. Both dimensions have domain [1,4]. The arrays have two attributes, a1 of
type int32 and a2 of type variable char. Each cell is identified by its coordinates, i.e., a pair of
rows, columns. In the examples, empty cells are shown in white, and nonempty cells in gray. In
the dense array, every cell has an attribute tuple, e.g., cell (4,4) contains <15; pppp>, whereas
several cells in the sparse array are empty, e.g., cell (4,1).
Figure:
The Logical array view

Overview of TileDB
 Compression: TileDB employs tile-based compression. Additionally, it
allows the user to select different compression schemes on a per-attribute
basis, as attributes are stored separately.
The API supports defining different compression schemes for each
attribute, but currently only gzip is implemented. The current version of
TileDB does not support more compression schemes, such as RLE, LZ, as
well as user-defined schemes

Overview of TileDB
 TileDB storage module or manager organizes array elements into ordered
collections called fragments in order to support both dense and sparse multi-
dimensional arrays. Fragments can either be dense or sparse.
A fragment is a timestamped snapshot of a batch of array updates, i.e., a collection of
array modifications carried out via write operations and made visible at a particular
time instant. For instance, the initial loading of the data into the array constitutes
the first array fragment. If at a later time a set of cells is modified, then these cells
constitute the second array fragment, and so on. In that sense, an array is
comprised of a collection of array fragments, each of which can be regarded as a
separate array, whose collective logical overlap composes the current logical view
of the array.
Dense fragments are used only with dense arrays, but sparse fragments may be
applied to both dense and sparse arrays
 The concept of fragments dramatically improves write performance as writes are
performed in batches, and each batch (which may consist of many small, random
writes) is written to a separate fragment sequentially.
 The organization into fragments turns random writes into sequential writes, and,

Fragment Examples
Figure below shows an example of an array with three fragments; the first two are dense and the
third is sparse. Observe that the second fragment is dense within a hyper-rectangular subarray,
whereas any cell outside this subarray is empty. The figure also illustrates the collective logical view
of the array; the cells of the most recent fragments overwrite those of the older ones.
Fragments are the key concept that enables TileDB to perform rapid writes.

Overview of TileDB
 TileDB array storage manager implements an efficient read algorithm to easily find
the most recent fragment that updated each returned element in situations where
multiple fragments may cover the same logical region of an array and some
fragments may be sparse.
This read algorithm allows it to efficiently access all fragments, skipping data that does
not qualify for the result. The algorithm is slightly different for dense and sparse
arrays
 Also, TileDB array storage manager employs a consolidation algorithm that merges
multiple fragments into a single one to tackle read performance degradation when
the number of fragments grow.
 TileDB enables or supports parallelization via multi-threading and multiprocessing,
offering thread-/process-safety and atomicity via lightweight locking. TileDB
delivers better performance than existing state-of-the-art systems (all thanks to its
efficient storage manager).
 Moreover, it guarantees operation atomicity, without imposing any additional
ordering constraints (this is in contrast to parallel implementations of HDF5, like
PHDF5)

Experiments
Competitors: TileDB was compared against 3 competitors
namely; HDF5 (for dense arrays), SciDB (for dense and sparse
arrays), and Vertica (for dense and sparse arrays). For all parallel
experiments, they used the parallel version of HDF5, namely
PHDF5.
System configuration: All experiments was performed on an Intel x86_64
platform with a 2.3 GHz 36-core CPU and 128 GB of RAM, running CentOS6. They
utilized a 4 TB, 7200rpm Western Digital HDD and a 480 GB Intel SSD both not RAID-
ed and equipped with the ext4 file system. Because the HDD is I/O bound even with a
single thread, they ran serial experiments on the HDD. Because the SSD is so much
faster, they ran our parallel experiments on SSDs.
For TileDB, SciDB and HDF5, they experimented both with and without gzip
compression, using compression level 6. Note: PHDF5 does not support parallel load or
updates on compressed datasets. In addition, Vertica cannot be used without
compression; the default compression method is RLE, which is currently not supported
by TileDB. Therefore, they used Vertica with gzip (compression level 6)

Experiments
For TileDB, SciDB and HDF5, they experimented both with and without gzip
compression, using compression level 6. Note: PHDF5 does not support parallel load
or updates on compressed datasets. In addition, Vertica cannot be used without
compression; the default compression method is RLE, which is currently not supported
by TileDB. Therefore, they have used Vertica with gzip (compression level 6).
Also, Vertica by default uses all the available cores in the system. In order to vary the
number of threads used in the parallel experiments, they simply shut down a subset of
the cores at the OS level.
Datasets: For dense arrays, they constructed synthetic 2D arrays of variable sizes,
with a single int32 attribute.
For the case of sparse arrays, they used datasets retrieved from the AIS database,
which was collected by the National Oceanic and Atmospheric Administration by
tracking ship vessels in the U.S. and international waters. They extracted attributes X
(longitude), Y (latitude), SOG, COG, Heading, ROT, Status, VoyageID, and
MMSI.They used X,Y as the dimensions (i.e., they created a 2D array), and the rest
characteristics as the attributes. SciDB does not support real dimension values, thus
they converted X and Y into int64,

Dense Arrays: TileDB vs HDF5, SciDB and Vertica
a. Load: Figure below shows the time to load a dense array into TileDB, HDF5 and
SciDB, as well as their gzip-compressed versions, denoted TileDB+Z, HDF5+Z and
SciDB+Z, respectively.
In this experiment, only one CPU core is activated and each system runs on a single
instance. They generated 2D synthetic arrays with sizes 4 GB, 8 GB, and 16 GB,
and dimension domains 50,000 × 20,000, 50,000 × 40,000, and 100,00 × 40,000
respectively.
TileDB matches the performance of HDF5 (it takes only 60s to load 4GB), and it is
consistently more than an order of magnitude faster than SciDB. A dollar (’$’)
symbol indicates that the system did not manage to terminate within the shown
time.
However, the performance of TileDB+Z, HDF5+Z and SciDB+Z deteriorates as the tile
Figure: Load performance of dense
array

a. Load: Figure below assesses the cost of loading data in parallel, for the 4GB array.
In this experiment, many CPU core were activated as the number of tested instances.
They excluded HDF5+Z, since PHDF5 does not allow parallel writes with
compression.
TileDB and HDF5 are unaffected by the level of parallelism, as they are I/O bound.
Moreover, each process writes large data chunks at a time monopolizing the
internal parallelism of the SSD and, thus, no extra savings are noticed with parallel
processes.
SciDB seems to scale because it is CPU bound. Moreover, the compressed versions
of all systems scale nicely, since they are CPU bound as well due to the CPU cost
of gzip compression.
The performance of TileDB and HDF5 match, whereas TileDB and TileDB+Z are
Figure: Load performance of dense
array

b. Update: Figure below shows the time to perform random element updates to the 4
GB array, as a function of the number of updated elements.
They excluded HDF5+Z as it does not support updates on compressed data.
TileDB is up to more than 2 orders of magnitude faster than HDF5, with 100K updates
running in less than 1s in TileDB vs. more than 100s in HDF5, and more than 4 orders
of magnitude faster than SciDB. This is due to the sequential, fragment-based writes of
TileDB, as opposed to the in-place updates in HDF5 and the chunk-based updates in
SciDB. The performance gap increases with the number of updated elements.
Figure: Random update
performance of dense arrays

b. Update: Figure below illustrates the update cost on SSDs when varying the number
of instances and fixing the number of element updates to 100K.
They split the updates equally into separate binary files, one for each instance.
TileDB and HDF5 are unaffected by the number of instances and, thus, all the previous
observations hold. SciDB performance improves with the number of instances.
However, it is still up to more than 3 orders of magnitude slower than TileDB.
Figure: Random update
performance of dense arrays

Sparse Arrays: TileDB vs SciDB and Vertica
Next focus is on sparse arrays, comparing TileDB with Vertica+Z (gzip-compressed
and following SRAM) and SciDB on the AIS dataset. HDF5 is not optimized for sparse
arrays, thus they omitted it from these experiments.
a. Load: Figure below shows the load performance of the three systems. They created
three different datasets with sizes 6 GB, 12 GB, 24 GB by loading from the dataset
(originally in CSV format), after de-duplicating the coordinates.
TileDB is once again more than an order of magnitude faster than SciDB. Moreover,
TileDB+Z is always slightly faster than Vertica+Z.
Figure: Load performance of sparse
arrays

Sparse Arrays: TileDB vs SciDB and Vertica
a. Load: Figure below plots the time to load the 6 GB dataset in parallel, versus the
number of instances. Note that Vertica by default uses all the cores in the system.
Thus, when they varied the number of instances, they activated only as many CPU
cores as the number of tested instances.
TileDB, TileDB+Z and Vertica+Z scale nicely with the number of instances. The
performance differences among the systems are similar to what was explained above
for the serial case.
Figure: Load performance of sparse
arrays

Conclusion: Key Takeaways
The evaluation reveals the following main results:
(i)TileDB is several orders of magnitude faster than HDF5 in the case of random
element updates for dense arrays, and at least as efficient in other settings, offering up
to 2x better performance on compressed data;
(ii)TileDB outperforms SciDB in all settings for both dense and sparse arrays, generally
by several orders of magnitude;
(iii)TileDB is 2x-40x faster than Vertica in the dense case. It is also at least as fast as
Vertica in the sparse case, featuring up to more than 2x faster parallel reads;
(iv)the performance of the read algorithm in TileDB is robust up to a large number of
fragments, whereas the consolidation mechanism requires marginally higher time than
that needed for the initial loading;
(v)TileDB exhibits excellent scalability as the dataset size (i.e. when operating on large
datasets) and level of parallelism increase.

Future Works
The datasets described only use synthetic 2D data with
integers only. A more valid experiment would be higher
dimensionality "real-world" data. Also, TileDB should be
run in modes that can be more correlated to its
competitors.
In addition, all compression schemes are not currently
supported by TileDB API. Its current implementation
only provides support for gzip while other schemes like
Run Length Encoding (RLE) and Lempel-Ziv (LZ) are
not supported.

�ݺ�ߣ

TileDB

More Related Content

TileDB