�ݺ�ߣ

MUDA
MUltiple Data Accelerator language

Project Overview
Feb 24, 2008
Syoyo FUJITA

GPU slumps
CPU soars
Geforce 9800 GX2 rumor

1 TFlops?( 3x of G80)
500 GFlops? (+50% of G80)

?
No
update !

PS3 Mac Pro octa
179.2 G?ops
+800 %
204 G?ops

2007 Feb/2008

Subprime shock!
Nikkei 225 index Credit boom ends!
US economy declines!
Green IT!

Future of GPU trend

Accelerated
computing

many-core GPGPU

CPU GPU

Accelerated
computing

many-core GPGPU

NO!
CPU GPU

GPGPU was dead!!
GPU will be dead soon!!

Why GPU -> GPGPU is
BAD
? Larger latency : host <-> PCI-ex
? Internal architecture is black box
? Only GPU maker knows it
? Larger cost of branching
? Debugger?
? Program only runs on speci?c GPU maker��s
GPU
? Not portable.

Why CPU -> Accelerated computing is
GOOD

? Easy to program
? CPU maker provides good internal spec
documentation
? Fast execution of branching
? gdb :-)
? Portable & Versatile

Accelerated
computing

many-core

MUDA
CPU

MUDA��s goal

? Withdraw CPU��s maximum
?oating point performance for
large data
? SIMD
? Cache optimized computation

MUDA example
MUDA code
vec sqrtmu(vec x)
{
vec y0, y0x, y0xhalf;
vec oneish = bit(0x3f800001);

y0 = rsqrt(x);
y0x = y0 * x;
y0xhalf = 0.5 * y0x;

return ((oneish - y0 * y0x) * y0xhalf + y0x);
}

__m128 sqrtmu (const __m128 * x)
{
x86/SSE output
__m128 y0 ;

__m128 y0x ;

__m128 y0xhalf ;

const __m128 t_vec4 = (__m128)_mm_set1_epi32( 1065353217) ;
__m128 oneish = t_vec4 ;

const __m128 t_vec6 = (*x) ;
const __m128 t_vec5 = _mm_rsqrt_ps( t_vec6) ;
y0 = t_vec5 ;

const __m128 t_vec8 = y0 ;
const __m128 t_vec9 = (*x) ;
const __m128 t_vec7 = _mm_mul_ps( t_vec8 , t_vec9 ) ;
y0x = t_vec7 ;

const ?oat t_?oat13 = 0.5 ;
const ?oat t_?oat12 = t_?oat13 ;
const __m128 t_vec10 = _mm_set_ps1( t_?oat12 ) ;
const __m128 t_vec14 = y0x ;
y0xhalf = t_vec11 ;

const __m128 t_vec19 = oneish ;
const __m128 t_vec20 = y0 ;
const __m128 t_vec16 = _mm_sub_ps( t_vec19 , t_vec15 ) ;
const __m128 t_vec22 = y0xhalf ;
const __m128 t_vec18 = _mm_add_ps( t_vec17 , t_vec23 ) ;
return t_vec18 ;
}

No uni?ed way to
describe SIMD op

? SSE: _mm_add_ps()
? AltiVec: vec_add
? SPE: spu_add

CPU ISA changes
frequently
? SSE2(2000), SSE3(2004), SSE4(2006)
? SSE5 and Coming New CPU design(?)
? 8-element SIMD?, no SIMD in the future
CPU?
? Keeping up with them is hard and
not productive. Waste of your
time.

SSE2 C code

SSE4 C code
MUDA
MUDA
compiler
VMX C code
Portable,
CPU independent
description
LLVM IR

CPU or Arch dependent
code

Status
? SSE2 backend : 75 %
? SSE4 backend : 0 %
? VMX backend : 20 %
? LLVM IR backend : 30 %
? SIMD math function for MUDA : 5 %
? Automatic optimizer : TODO
= I��m currently working on

Future direction
? Cache miss analysis and memory access
optimization

? Valgrind, Cache Miss Equation(CME)

? Automatic optimization
? Such like FFTW, ATLAS and Spiral are doing
? Automatic error measurement for
?oating point computation

? Interval Arithmetic, Af?ne Arithmetic, Gappa

Performance gap
100

75

Better
50

Scalar:SIMD cache miss:cache hit
25
= =
1:4 1:100
0
SIMD Memory

Performance gap
100

Optimizing memory access is much
75
more important than SIMDization
Better
50

Scalar:SIMD cache miss:cache hit
25
= =
1:4 1:100
0
SIMD Memory

�ݺ�ߣ

Muda Proposal

More Related Content

Muda Proposal