The document discusses the origins and history of large-scale array-oriented computing with Python. It describes how NumPy emerged from earlier Python scientific computing packages like Numeric and Numarray. NumPy provides fast operations on multi-dimensional arrays through vectorization and has become a fundamental tool for scientific computing in Python. The document also outlines ideas for future improvements to NumPy and scientific Python packages through projects like Blaze which aim to provide out-of-core and distributed computing capabilities.
1 of 68
Downloaded 171 times
More Related Content
Large-scale Array-oriented Computing with Python
1. Large-scale array-oriented
computing with Python
Travis E. Oliphant
PyCon Taiwan, June 9, 2012
Friday, June 8, 12
3. My Roots
Images from BYU Mers Lab
Friday, June 8, 12
4. Science led to Python
2
?0 (2?f ) Ui (a, f ) = [Cijkl (a, f ) Uk,l (a, f )],j
Raja Muthupillai
Richard Ehman
1997
Armando Manduca
Friday, June 8, 12
9. Brief History
Person Package Year
Matrix Object
Jim Fulton 1994
in Python
Jim Hugunin Numeric 1995
Perry Green?eld, Rick
White, Todd Miller Numarray 2001
Travis Oliphant NumPy 2005
Friday, June 8, 12
10. 1999 : Early SciPy emerges
Discussions on the matrix-sig from 1997 to 1999 wanting a complete data analysis
environment: Paul Barrett, Joe Harrington, Perry Green?eld, Paul Dubois, Konrad Hinsen,
and others. Activity in 1998, led to increased interest in 1999.
In response on 15 Jan, 1999, I posted to matrix-sig a list of routines I felt needed to be
present and began wrapping / writing in earnest. On 6 April 1999, I announced I would
be creating this uber-package which eventually became SciPy
Gaussian quadrature 5 Jan 1999
cephes 1.0 30 Jan 1999
Plotting??
sigtools 0.40 23 Feb 1999
Numeric docs
cephes 1.1
March 1999
9 Mar 1999
Gist
multipack 0.3 13 Apr 1999 XPLOT
Helper routines 14 Apr 1999 DISLIN
multipack 0.6 (leastsq, ode, fsolve,
quad)
29 Apr 1999
Gnuplot
sparse plan described 30 May 1999
multipack 0.7
SparsePy 0.1
14 Jun 1999
5 Nov 1999 Helping with f2py
cephes 1.2 (vectorize) 29 Dec 1999
Friday, June 8, 12
11. SciPy 2001 Travis Oliphant
optimize
sparse
interpolate
integrate
special
signal
stats Founded in 2001 with Travis Vaught
fftpack
misc
Eric Jones
weave
cluster
Pearu Peterson
GA*
linalg
interpolate
f2py
Friday, June 8, 12
12. Community effort
? Chuck Harris
? Pauli Virtanen
? David Cournapeau
? Stefan van der Walt
? Dag Sverre Seljebotn
? Robert Kern
? Warren Weckesser
? Ralf Gommers
? Mark Wiebe
? Nathaniel Smith
Friday, June 8, 12
13. Why Python for Technical Computing
? Syntax (it gets out of your way)
? Over-loadable operators
? Complex numbers built-in early
? Just enough language support for arrays
? Occasional programmers can grok it
? Supports multiple programming styles
? Expert programmers can also use it effectively
? Has a simple, extensible implementation
? General-purpose language --- can build a system
? Critical mass
Friday, June 8, 12
14. What is wrong with Python?
? Packaging is still not solved well (distribute, pip, and
distutils2 dont cut it)
? Missing anonymous blocks
? The CPython run-time is aged and needs an overhaul
(GIL, global variables, lack of dynamic compilation
support)
? No approach to language extension except for
import hooks (lightweight DSL need)
? The distraction of multiple run-times...
? Array-oriented and NumPy not really understood by
most Python devs.
Friday, June 8, 12
15. Putting Science back in Comp Sci
? Much of the software stack is for systems
programming --- C++, Java, .NET, ObjC, web
- Complex numbers?
- Vectorized primitives?
? Array-oriented programming has been
supplanted by Object-oriented programming
? Software stack for scientists is not as helpful
as it should be
? Fortran is still where many scientists end up
Friday, June 8, 12
20. NumPy: an Array-Oriented Extension
? Data: the array object
C slicing and shaping
C data-type map to Bytes
? Fast Math:
C vectorization
C broadcasting
C aggregations
Friday, June 8, 12
22. Zen of NumPy
? strided is better than scattered
? contiguous is better than strided
? descriptive is better than imperative
? array-oriented is better than object-oriented
? broadcasting is a great idea
? vectorized is better than an explicit loop
? unless its too complicated --- then use Cython/Numba
? think in higher dimensions
Friday, June 8, 12
24. Conways game of Life
? Dead cell with exactly 3 live neighbors
will come to life
? A live cell with 2 or 3 neighbors will
survive
? With too few or too many neighbors, the
cell dies
Friday, June 8, 12
26. APL : the ?rst array-oriented language
? Appeared in 1964
? Originated by Ken Iverson
? Direct descendants (J, K, Matlab) are still
used heavily and people pay a lot of money
for them APL
? NumPy is a descendent J
K Matlab
Numeric
NumPy
Friday, June 8, 12
27. Conways Game of Life
APL
NumPy
Initialization
Update Step
Friday, June 8, 12
28. Demo
Python Version
Array-oriented NumPy Version
Friday, June 8, 12
31. Bene?ts of Array-oriented
? Many technical problems are naturally array-
oriented (easy to vectorize)
? Algorithms can be expressed at a high-level
? These algorithms can be parallelized more
simply (quite often much information is lost in
the translation to typical compiled languages)
? Array-oriented algorithms map to modern
hard-ware caches and pipelines.
Friday, June 8, 12
32. We need more focus on
complied array-oriented
languages with fast compilers!
Friday, June 8, 12
33. What is good about NumPy?
? Array-oriented
? Extensive Dtype System (including structures)
? C-API
? Simple to understand data-structure
? Memory mapping
? Syntax support from Python
? Large community of users
? Broadcasting
? Easy to interface C/C++/Fortran code
Friday, June 8, 12
34. What is wrong with NumPy
? Dtype system is dif?cult to extend
? Immediate mode creates huge temporaries
(spawning Numexpr)
? Almost an in-memory data-base comparable
to SQL-lite (missing indexes)
? Integration with sparse arrays
? Lots of un-optimized parts
? Minimal support for multi-core / GPU
? Code-base is organic and hard to extend
Friday, June 8, 12
35. Improvements needed
? NDArray improvements
? Indexes (esp. for Structured arrays)
? SQL front-end
? Multi-level, hierarchical labels
? selection via mappings (labeled arrays)
? Memory spaces (array made up of regions)
? Distributed arrays (global array)
? Compressed arrays
? Standard distributed persistance
? fancy indexing as view and optimizations
? streaming arrays
Friday, June 8, 12
36. Improvements needed
? Dtype improvements
? Enumerated types (including dynamic enumeration)
? Derived fields
? Specification as a class (or JSON)
? Pointer dtype (i.e. C++ object, or varchar)
? Finishing datetime
? Missing data with bit-patterns
? Parameterized field names
Friday, June 8, 12
37. Example of Object-de?ned Dtype
@np.dtype
class Stock(np.DType):
symbol = np.Str(4)
open = np.Int(2)
close = np.Int(2)
high = np.Int(2)
low = np.Int(2)
@np.Int(2)
def mid(self):
return (self.high + self.low) / 2.0
Friday, June 8, 12
38. Improvements needed
? Ufunc improvements
? Generalized ufuncs support more than just
contiguous arrays
? Specification of ufuncs in Python
? Move most dtype array functions to ufuncs
? Unify error-handling for all computations
? Allow lazy-evaluation and remote computation ---
streaming and generator data
? Structured and string dtype ufuncs
? Multi-core and GPU optimized ufuncs
? Group-by reduction
Friday, June 8, 12
39. More Improvements needed
? Miscellaneous improvements
? ABI-management
? Eventual Move to library (NDLib)?
? Integration with LLVM
? Sparse dimensions
? Remote computation
? Fast I/O for CSV and Excel
? Out-of-core calculations
? Delayed-mode execution
Friday, June 8, 12
40. New Project
NumPy
Blaze
Next Generation NumPy
Out-of-core
Distributed Tables
Friday, June 8, 12
41. Blaze Main Features
? New ndarray with multiple memory segments
? Distributed ndtable which can span the world
? Fast, out-of-core algorithms for all functions
? Delayed-mode execution: expressions build up
graph which gets executed where the data is
? Built-in Indexes (beyond searchsorted)
? Built-in labels (data-array)
? Sparse dimensions (de?ned by attributes or
elements of another dimension)
? Direct adapters to all data (move code to data)
Friday, June 8, 12
46. Data URLs
? Variables in script are global addresses (DATA
URLs). All the worlds data you can see via web
can be in used as part of an algorithm by
referencing it as a part of an array.
? Dynamically interpret bytes as data-type
? Scheduler will push code based on data-type
to the data instead of pulling data to the code.
Friday, June 8, 12
48. NDArray
? Local ndarray (NumPy++)
? Multiple byte-buffers (streaming or random
access)
? Variable-length arrays
? All kinds of data-types (everything...)
? Multiple patterns of memory access possible
(Z-order, Fortran-order, C-order)
? Sparse dimensions
Friday, June 8, 12
49. GFunc
? Generalized Function
? All NumPy functions
? element-by-element
? linear algebra
? manipulation
? Fourier Transform
? Iteration and Dispatch to low-level kernels
? Kernels can be written in anything that builds a
C-like interface
Friday, June 8, 12
50. Early Timeline
Date Milestone
July 2012 Pre-alpha release
December 2012 Early Beta Release
June 2013 Version 1.0
Friday, June 8, 12
51. PyData
All computing modules known to work with
Blaze will be placed under PyData umbrella of
projects over the coming years.
Friday, June 8, 12
53. NumPy Users
? Want to be able to write Python to get fast
code that works on arrays and scalars
? Need access to a boat-load of C-extensions
(NumPy is just the beginning)
PyPy doesnt cut it for us!
Friday, June 8, 12
54. Friday, June 8, 12
Ufuncs
Generalized
UFuncs
Python
Function
Window
Kernel
Funcs
Function-
based
Indexing
Memory
Dynamic compilation
Filters
Dynamic
Compilation
NumPy Runtime
I/O Filters
Reduction
Filters
Computed
Columns
function pointer
55. SciPy needs a Python compiler
optimize integrate
special ode
writing more of SciPy at high-level
Friday, June 8, 12
56. Numba -- a Python compiler
? Replays byte-code on a stack with simple type-
inference
? Translates to LLVM (using LLVM-py)
? Uses LLVM for code-gen
? Resulting C-level function-pointer can be
inserted into NumPy run-time
? Understands NumPy arrays
? Is NumPy / SciPy aware
Friday, June 8, 12
57. NumPy + Mamba = Numba
Python Function Machine Code
LLVM-PY
LLVM 3.1
ISPC OpenCL OpenMP CUDA CLANG
Intel AMD Nvidia Apple
Friday, June 8, 12
63. NumFOCUS
Num(Py) Foundation for Open Code for Usable Science
Friday, June 8, 12
64. NumFOCUS
? Mission
? To initiate and support educational programs
furthering the use of open source software in
science.
? To promote the use of high-level languages and
open source in science, engineering, and math
research
? To encourage reproducible scientific research
? To provide infrastructure and support for open
source projects for technical computing
Friday, June 8, 12
65. NumFOCUS
? Activites
? Sponsor sprints and conferences
? Provide scholarships and grants for people using
these tools
? Pay for documentation development and basic
course development
? Fund continuous integration and build systems
? Work with domain-specific organizations
? Raise funds from industries using Python and
NumPy
Friday, June 8, 12
66. NumFOCUS
Core Projects
NumPy SciPy IPython Matplotlib
Other Projects (seeking more --- need representatives)
Scikits Image
Friday, June 8, 12
67. NumFOCUS
? Directors
? Perry Greenfield
? John Hunter
? Jarrod Millman
? Travis Oliphant
? Fernando Perez
? Members
? Basically people who donate for now. In time, a
body that elects directors.
Friday, June 8, 12
68. ? Large-scale data analysis products
? Python and NumPy training
? NumPy support and consulting
? Rich-client or web user-interfaces
? Blaze and PyData Development
Friday, June 8, 12