The document discusses MUDA, a language for describing SIMD operations in a portable way across CPU architectures. MUDA aims to withdraw maximum floating point performance from CPUs for large data by using SIMD and cache optimized computation. The status lists backends under development for MUDA, and future directions include automatic optimization of memory access and cache misses to improve performance beyond just SIMDization. Optimizing memory is seen as much more important for performance than SIMD alone.
9. Accelerated
computing
many-core GPGPU
NO!
CPU GPU
GPGPU was dead!!
GPU will be dead soon!!
10. Why GPU -> GPGPU is
BAD
? Larger latency : host <-> PCI-ex
? Internal architecture is black box
? Only GPU maker knows it
? Larger cost of branching
? Debugger?
? Program only runs on speci?c GPU makers
GPU
? Not portable.
11. Why CPU -> Accelerated computing is
GOOD
? Easy to program
? CPU maker provides good internal spec
documentation
? Fast execution of branching
? gdb :-)
? Portable & Versatile
17. No uni?ed way to
describe SIMD op
? SSE: _mm_add_ps()
? AltiVec: vec_add
? SPE: spu_add
18. CPU ISA changes
frequently
? SSE2(2000), SSE3(2004), SSE4(2006)
? SSE5 and Coming New CPU design(?)
? 8-element SIMD?, no SIMD in the future
CPU?
? Keeping up with them is hard and
not productive. Waste of your
time.
19. SSE2 C code
SSE4 C code
MUDA
MUDA
compiler
VMX C code
Portable,
CPU independent
description
LLVM IR
CPU or Arch dependent
code
20. Status
? SSE2 backend : 75 %
? SSE4 backend : 0 %
? VMX backend : 20 %
? LLVM IR backend : 30 %
? SIMD math function for MUDA : 5 %
? Automatic optimizer : TODO
= Im currently working on
21. Future direction
? Cache miss analysis and memory access
optimization
? Valgrind, Cache Miss Equation(CME)
? Automatic optimization
? Such like FFTW, ATLAS and Spiral are doing
? Automatic error measurement for
?oating point computation
? Interval Arithmetic, Af?ne Arithmetic, Gappa
22. Performance gap
100
75
Better
50
Scalar:SIMD cache miss:cache hit
25
= =
1:4 1:100
0
SIMD Memory
23. Performance gap
100
Optimizing memory access is much
75
more important than SIMDization
Better
50
Scalar:SIMD cache miss:cache hit
25
= =
1:4 1:100
0
SIMD Memory