
ݺߣShare a Scribd company logo
MUltiple Data Accelerator language

        Project Overview
          Feb 24, 2008
            Syoyo FUJITA
Nikkei 225 index
GPU slumps
CPU soars
                              Geforce 9800 GX2 rumor

                              1 TFlops?( 3x of G80)
                              500 GFlops? (+50% of G80)

                                  update !

                PS3                     Mac Pro octa
             179.2 G?ops
                            +800 %
                                  204 G?ops

                           2007         Feb/2008
Nikkei 225 index
Subprime shock!
Nikkei 225 index   Credit boom ends!
                   US economy declines!
                   Green IT!

     Future of GPU trend

 many-core                 GPGPU

CPU                                GPU

 many-core                 GPGPU

CPU                                  GPU

                    GPGPU was dead!!
                    GPU will be dead soon!!
Why GPU -> GPGPU is
? Larger latency : host <-> PCI-ex
? Internal architecture is black box
 ? Only GPU maker knows it
? Larger cost of branching
? Debugger?
? Program only runs on speci?c GPU makers
 ? Not portable.
Why CPU -> Accelerated computing is

? Easy to program
? CPU maker provides good internal spec
? Fast execution of branching
? gdb :-)
? Portable & Versatile


MUDAs goal

? Withdraw CPUs maximum
 ?oating point performance for
 large data
 ? Cache optimized computation
MUDA example
MUDA code
vec sqrtmu(vec x)
    vec y0, y0x, y0xhalf;
    vec oneish = bit(0x3f800001);

    y0 = rsqrt(x);
    y0x = y0 * x;
    y0xhalf = 0.5 * y0x;

    return ((oneish - y0 * y0x) * y0xhalf + y0x);
__m128 sqrtmu (const __m128 * x)
                                                                  x86/SSE output
  __m128 y0 ;

    __m128 y0x ;

    __m128 y0xhalf ;

    const __m128 t_vec4 = (__m128)_mm_set1_epi32( 1065353217) ;
    __m128 oneish = t_vec4 ;

    const __m128 t_vec6 = (*x) ;
    const __m128 t_vec5 = _mm_rsqrt_ps( t_vec6) ;
    y0 = t_vec5 ;

    const __m128 t_vec8 = y0 ;
    const __m128 t_vec9 = (*x) ;
    const __m128 t_vec7 = _mm_mul_ps( t_vec8 , t_vec9 ) ;
    y0x = t_vec7 ;

    const ?oat t_?oat13 = 0.5 ;
    const ?oat t_?oat12 = t_?oat13 ;
    const __m128 t_vec10 = _mm_set_ps1( t_?oat12 ) ;
    const __m128 t_vec14 = y0x ;
    const __m128 t_vec11 = _mm_mul_ps( t_vec10 , t_vec14 ) ;
    y0xhalf = t_vec11 ;

    const __m128 t_vec19 = oneish ;
    const __m128 t_vec20 = y0 ;
    const __m128 t_vec21 = y0x ;
    const __m128 t_vec15 = _mm_mul_ps( t_vec20 ,    t_vec21 ) ;
    const __m128 t_vec16 = _mm_sub_ps( t_vec19 ,    t_vec15 ) ;
    const __m128 t_vec22 = y0xhalf ;
    const __m128 t_vec17 = _mm_mul_ps( t_vec16 ,    t_vec22 ) ;
    const __m128 t_vec23 = y0x ;
    const __m128 t_vec18 = _mm_add_ps( t_vec17 ,    t_vec23 ) ;
    return t_vec18 ;
No uni?ed way to
    describe SIMD op

? SSE: _mm_add_ps()
? AltiVec: vec_add
? SPE: spu_add
CPU ISA changes
? SSE2(2000), SSE3(2004), SSE4(2006)
? SSE5 and Coming New CPU design(?)
? 8-element SIMD?, no SIMD in the future
? Keeping up with them is hard and
  not productive. Waste of your
SSE2 C code

                                   SSE4 C code
                                   VMX C code
CPU independent
                                    LLVM IR

                             CPU or Arch dependent
? SSE2 backend : 75 %
? SSE4 backend : 0 %
? VMX backend : 20 %
? LLVM IR backend : 30 %
? SIMD math function for MUDA : 5 %
? Automatic optimizer : TODO
     = Im currently working on
Future direction
?   Cache miss analysis and memory access

    ?   Valgrind, Cache Miss Equation(CME)

? Automatic optimization
  ? Such like FFTW, ATLAS and Spiral are doing
? Automatic error measurement for
    ?oating point computation

    ?   Interval Arithmetic, Af?ne Arithmetic, Gappa
Performance gap



                Scalar:SIMD   cache miss:cache hit
                      =                =
                     1:4             1:100
                   SIMD           Memory
Performance gap

                Optimizing memory access is much
                more important than SIMDization

                Scalar:SIMD     cache miss:cache hit
                      =                  =
                     1:4               1:100
                   SIMD             Memory

More Related Content

Muda Proposal

  • 1. MUDA MUltiple Data Accelerator language Project Overview Feb 24, 2008 Syoyo FUJITA
  • 2. ?
  • 4. ?
  • 5. GPU slumps CPU soars Geforce 9800 GX2 rumor 1 TFlops?( 3x of G80) 500 GFlops? (+50% of G80) ? No update ! PS3 Mac Pro octa 179.2 G?ops +800 % 204 G?ops 2007 Feb/2008
  • 7. Subprime shock! Nikkei 225 index Credit boom ends! US economy declines! Green IT! Future of GPU trend
  • 8. Accelerated computing many-core GPGPU CPU GPU
  • 9. Accelerated computing many-core GPGPU NO! CPU GPU GPGPU was dead!! GPU will be dead soon!!
  • 10. Why GPU -> GPGPU is BAD ? Larger latency : host <-> PCI-ex ? Internal architecture is black box ? Only GPU maker knows it ? Larger cost of branching ? Debugger? ? Program only runs on speci?c GPU makers GPU ? Not portable.
  • 11. Why CPU -> Accelerated computing is GOOD ? Easy to program ? CPU maker provides good internal spec documentation ? Fast execution of branching ? gdb :-) ? Portable & Versatile
  • 12. Accelerated computing many-core MUDA CPU
  • 13. MUDAs goal ? Withdraw CPUs maximum ?oating point performance for large data ? SIMD ? Cache optimized computation
  • 14. MUDA example MUDA code vec sqrtmu(vec x) { vec y0, y0x, y0xhalf; vec oneish = bit(0x3f800001); y0 = rsqrt(x); y0x = y0 * x; y0xhalf = 0.5 * y0x; return ((oneish - y0 * y0x) * y0xhalf + y0x); }
  • 15. __m128 sqrtmu (const __m128 * x) { x86/SSE output __m128 y0 ; __m128 y0x ; __m128 y0xhalf ; const __m128 t_vec4 = (__m128)_mm_set1_epi32( 1065353217) ; __m128 oneish = t_vec4 ; const __m128 t_vec6 = (*x) ; const __m128 t_vec5 = _mm_rsqrt_ps( t_vec6) ; y0 = t_vec5 ; const __m128 t_vec8 = y0 ; const __m128 t_vec9 = (*x) ; const __m128 t_vec7 = _mm_mul_ps( t_vec8 , t_vec9 ) ; y0x = t_vec7 ; const ?oat t_?oat13 = 0.5 ; const ?oat t_?oat12 = t_?oat13 ; const __m128 t_vec10 = _mm_set_ps1( t_?oat12 ) ; const __m128 t_vec14 = y0x ; const __m128 t_vec11 = _mm_mul_ps( t_vec10 , t_vec14 ) ; y0xhalf = t_vec11 ; const __m128 t_vec19 = oneish ; const __m128 t_vec20 = y0 ; const __m128 t_vec21 = y0x ; const __m128 t_vec15 = _mm_mul_ps( t_vec20 , t_vec21 ) ; const __m128 t_vec16 = _mm_sub_ps( t_vec19 , t_vec15 ) ; const __m128 t_vec22 = y0xhalf ; const __m128 t_vec17 = _mm_mul_ps( t_vec16 , t_vec22 ) ; const __m128 t_vec23 = y0x ; const __m128 t_vec18 = _mm_add_ps( t_vec17 , t_vec23 ) ; return t_vec18 ; }
  • 17. No uni?ed way to describe SIMD op ? SSE: _mm_add_ps() ? AltiVec: vec_add ? SPE: spu_add
  • 18. CPU ISA changes frequently ? SSE2(2000), SSE3(2004), SSE4(2006) ? SSE5 and Coming New CPU design(?) ? 8-element SIMD?, no SIMD in the future CPU? ? Keeping up with them is hard and not productive. Waste of your time.
  • 19. SSE2 C code SSE4 C code MUDA MUDA compiler VMX C code Portable, CPU independent description LLVM IR CPU or Arch dependent code
  • 20. Status ? SSE2 backend : 75 % ? SSE4 backend : 0 % ? VMX backend : 20 % ? LLVM IR backend : 30 % ? SIMD math function for MUDA : 5 % ? Automatic optimizer : TODO = Im currently working on
  • 21. Future direction ? Cache miss analysis and memory access optimization ? Valgrind, Cache Miss Equation(CME) ? Automatic optimization ? Such like FFTW, ATLAS and Spiral are doing ? Automatic error measurement for ?oating point computation ? Interval Arithmetic, Af?ne Arithmetic, Gappa
  • 22. Performance gap 100 75 Better 50 Scalar:SIMD cache miss:cache hit 25 = = 1:4 1:100 0 SIMD Memory
  • 23. Performance gap 100 Optimizing memory access is much 75 more important than SIMDization Better 50 Scalar:SIMD cache miss:cache hit 25 = = 1:4 1:100 0 SIMD Memory