This document discusses using PyTables to analyze large datasets. PyTables is built on HDF5 and uses NumPy to provide an object-oriented interface for efficiently browsing, processing, and querying very large amounts of data. It addresses the problem of CPU starvation by utilizing techniques like caching, compression, and high performance libraries like Numexpr and Blosc to minimize data transfer times. PyTables allows fast querying of data through flexible iterators and indexing to facilitate extracting important information from large datasets.
2. Personal Profile:
Ali Hallaji
Parallel Processing and Large Data Analyze
Senior Python Developer at innfinision Cloud Solutions
Ali.Hallaji@innfinision.net
Innfinision.net
3. innfinision Cloud Solutions:
Providing Cloud, Virtualization and Data Center Solutions
Developing Software for Cloud Environments
Providing Services to Telecom, Education, Broadcasting & Health Fields
Supporting OpenStack Foundation as the First Iranian Company
First Supporter of IRAN OpenStack Community
4. Large Data Analyze with PyTables innfinision.net
Outline
What is PyTables ?
Numexpr & Numpy
Compressing Data
What is HDF5?
Querying your data in many different ways, fast
Design goals
Agenda:
6. Outline
innfinision.netLarge Data Analyze with PyTables
The Starving CPU Problem
Getting the Most Out of Computers
Caches and Data Locality
Techniques For Fighting Data
Starvation
High Performance Libraries
Why Should You Use Them?
In-Core High Performance
Out-of-Core High Performance
Libraries
7. Getting the Most Out of Computers
innfinision.netLarge Data Analyze with PyTables
8. Getting the Most Out of Computers
innfinision.netLarge Data Analyze with PyTables
Computers nowadays are very powerful:
Extremely fast CPUs (multicores)
Large amounts of RAM
Huge disk capacities
But they are facing a pervasive problem:
An ever-increasing mismatch between CPU, memory and disk speeds (the
so-called Starving CPU problem)
This introduces tremendous difficulties in getting the most out of
computers.
9. CPU vs Memory cycle Trend
innfinision.netLarge Data Analyze with PyTables
Cycle time is the time, usually measured in nanosecond s, between the start of
one random access memory ( RAM ) access to the time when the next access can
be started
History
In the 1970s and 1980s the memory subsystem was able to
deliver all the data that processors required in time.
In the good old days, the processor was the key bottleneck.
But in the 1990s things started to change...
10. CPU vs Memory cycle Trend
innfinision.netLarge Data Analyze with PyTables
dd
11. The CPU Starvation Problem
innfinision.netLarge Data Analyze with PyTables
Known facts (in 2010):
Memory latency is much higher (around 250x) than processors and it has been
an essential bottleneck for the past twenty years.
Memory throughput is improving at a better rate than memory latency, but it
is also much slower than processors (about 25x).
The result is that CPUs in our current computers are suffering from
a serious data starvation problem: they could consume (much!)
more data than the system can possibly deliver.
12. What Is the Industry Doing to Alleviate CPU Starvation?
innfinision.netLarge Data Analyze with PyTables
They are improving memory throughput: cheap to implement
(more data is transmitted on each clock cycle).
They are adding big caches in the CPU dies.
13. Why Is a Cache Useful?
innfinision.netLarge Data Analyze with PyTables
Caches are closer to the processor (normally in the same die),
so both the latency and throughput are improved.
However: the faster they run the smaller they must be.
They are effective mainly in a couple of scenarios:
Time locality: when the dataset is reused.
Spatial locality: when the dataset is accessed sequentially.
16. Why High Performance Libraries?
innfinision.netLarge Data Analyze with PyTables
High performance libraries are made by people that knows very
well the different optimization techniques.
You may be tempted to create original algorithms that can be
faster than these, but in general, it is very difficult to beat
them.
In some cases, it may take some time to get used to them, but
the effort pays off in the long run.
17. Some In-Core High Performance Libraries
innfinision.netLarge Data Analyze with PyTables
ATLAS/MKL (Intels Math Kernel Library): Uses memory efficient algorithms as well
as SIMD and multi-core algorithms linear algebra operations.
VML (Intels Vector Math Library): Uses SIMD and multi-core to compute basic math
functions (sin, cos, exp, log...) in vectors.
Numexpr: Performs potentially complex operations with NumPy arrays without the
overhead of temporaries. Can make use of multi-cores.
Blosc: A multi-threaded compressor that can transmit data from caches to memory,
and back, at speeds that can be larger than a OS memcpy().
18. What is PyTables ?
innfinision.netLarge Data Analyze with PyTables
19. PyTables
innfinision.netLarge Data Analyze with PyTables
PyTables is a package for managing hierarchical datasets and designed to efficiently
and easily cope with extremely large amounts of data. You can download PyTables
and use it for free. You can access documentation, some examples of use and
presentations in the HowToUse section.
PyTables is built on top of the HDF5 library, using the Python language and the
NumPy package. It features an object-oriented interface that, combined with C
extensions for the performance-critical parts of the code (generated using Cython),
makes it a fast, yet extremely easy to use tool for interactively browse, process and
search very large amounts of data. One important feature of PyTables is that it
optimizes memory and disk resources so that data takes much less space (specially if
on-flight compression is used) than other solutions such as relational or object
oriented databases.
21. Numexpr: Dealing with Complex Expressions
innfinision.netLarge Data Analyze with PyTables
Wears a specialized virtual machine for evaluating expressions.
It accelerates computations by using blocking and by avoiding
temporaries.
Multi-threaded: can use several cores automatically.
It has support for Intels VML (Vector Math Library), so you
can accelerate the evaluation of transcendental (sin, cos,
atanh, sqrt. . . ) functions too.
22. NumPy: A Powerful Data Container for Python
innfinision.netLarge Data Analyze with PyTables
NumPy provides a very powerful, object oriented, multi dimensional data container:
array[index]: retrieves a portion of a data container
(array1**3 / array2) - sin(array3): evaluates potentially complex
expressions
numpy.dot(array1, array2): access to optimized BLAS (*GEMM) functions
27. Why Compression
innfinision.netLarge Data Analyze with PyTables
Lets you store more data using the same space
Uses more CPU, but CPU time is cheap compared with disk access
Different compressors for different uses:
Bzip2, zlib, lzo, Blosc
29. Why Compression
innfinision.netLarge Data Analyze with PyTables
Less data needs to be transmitted to the CPU
Transmission + decompression faster than direct transfer?
31. What is HDF5 ?
innfinision.netLarge Data Analyze with PyTables
32. HDF5
innfinision.netLarge Data Analyze with PyTables
HDF5 is a data model, library, and file format for storing and managing data. It
supports an unlimited variety of datatypes, and is designed for flexible and
efficient I/O and for high volume and complex data. HDF5 is portable and is
extensible, allowing applications to evolve in their use of HDF5. The HDF5
Technology suite includes tools and applications for managing, manipulating,
viewing, and analyzing data in the HDF5 format.
33. The HDF5 technology suite includes:
innfinision.netLarge Data Analyze with PyTables
A versatile data model that can represent very complex data objects and a wide
variety of metadata.
A completely portable file format with no limit on the number or size of data
objects in the collection.
A software library that runs on a range of computational platforms, from laptops to
massively parallel systems, and implements a high-level API with C, C++,
Fortran 90, and Java interfaces.
A rich set of integrated performance features that allow for access time and
storage space optimizations.
Tools and applications for managing, manipulating, viewing, and analyzing the data
in the collection.
34. Data structures
innfinision.netLarge Data Analyze with PyTables
High level of flexibility for structuring your data:
Datatypes: scalars (numerical & strings), records, enumerated, time...
Table support multidimensional cells and nested records
Multidimensional arrays
Variable lengths arrays
37. Querying your data in many different
ways, fast
innfinision.netLarge Data Analyze with PyTables
38. PyTables Query
innfinision.netLarge Data Analyze with PyTables
One characteristic that sets PyTables apart from similar tools is its capability to
perform extremely fast queries on your tables in order to facilitate as much as
possible your main goal: get important information *out* of your datasets.
PyTables achieves so via a very flexible and efficient query iterator, named
Table.where(). This, in combination with OPSI, the powerful indexing engine that
comes with PyTables, and the efficiency of underlying tools like NumPy, HDF5,
Numexpr and Blosc, makes of PyTables one of the fastest and more powerful query
engines available.
39. Different query modes
innfinision.netLarge Data Analyze with PyTables
Regular query:
[ r[c1] for r in table
if r[c2] > 2.1 and r[c3] == True)) ]
In-kernel query:
[ r[c1] for r in table.where((c2>2.1)&(c3==True)) ]
Indexed query:
table.cols.c2.createIndex()
table.cols.c3.createIndex()
[ r[c1] for r in table.where((c2>2.1)&(c3==True)) ]
40. innfinision.netLarge Data Analyze with PyTables
This presentation has been collected from
several other presentations(PyTables presentation).
For more presentation refer to this
link (http://pytables.org/moin/HowToUse#Presentations).