際際滷

際際滷Share a Scribd company logo
亳亠从舒
NVIDIA Kepler


仂亳亰于仂亟亳亠仍仆仂. 亅亠从亳于仆仂. 仂仗仆仂.
                                                  1
Tesla: 于 2-3 舒亰舒 弍亠亠 从舒亢亟亠 2 亞仂亟舒
                     16
                                                             Maxwell

                     14
DP GFLOPS per Watt




                     12

                     10

                      8

                      6                           Kepler

                      4
                                       Fermi
                      2     T10


                             2008       2010       2012        2014
                                                                       2
Kepler



         3
Kepler
弌乘 乘弌丐亊  亅个个丐亊 HPC 丱丐丐丕


                               SMX

                             Hyper-Q

                        Dynamic Parallelism

                                              4
Kepler: 弌从仂仂 亳 亅亠从亳于仆仂

     SM                               SMX
     M2090                               K20




                     3x
                                   丕亊豫乂亊 
丕亊豫乂亊 




                     Perf / Watt
    32 磲舒                           192 磲舒
                                                        5
1 亠仍仂仗
亠亞仂 于 10 仂亶从舒
     400 从

                     6
Hyper-Q
CPU 磲舒 仂亟仆仂于亠仄亠仆仆仂 亰舒仗从舒ム 亰舒亟舒亳 仆舒 Kepler
             FERMI                                KEPLER
     1 MPI 亰舒亟舒舒 仂亟仆仂于亠仄亠仆仆仂             32 MPI 亰舒亟舒亳 仂亟仆仂于亠仄亠仆仆仂




                                                                        7
Hyper-Q
舒从亳仄舒仍仆舒 亳仍亳亰舒亳 GPU, 仂从舒亠仆亳亠 于亠仄亠仆亳 仗仂仂 CPU

                   100                                 100




                                    丕亳仍亳亰舒亳 GPU %
丕亳仍亳亰舒亳 GPU %




                   50                                  50




                    0                                   0
                         Time                                Time   8
Dynamic Parallelism
GPU 舒亟舒仗亳亠 从 亟舒仆仆仄, 亟亳仆舒仄亳亠从亳 仗仂仂亢亟舒 仆仂于亠 仗仂仂从亳

   CPU      Fermi GPU            CPU        Kepler GPU




                                                          9
Dynamic Parallelism
仂亞舒仄仄亳仂于舒仆亳亠 仆舒 GPU 仗仂亠 亳 亟仂仗仆亠亠
 弌仍亳从仂仄 亞弍仂   弌仍亳从仂仄 仄亠仍从仂   舒从 仆舒亟仂




                                            10
Tesla K10                      Tesla K20




     3x 仂亟亳仆舒仆舒 仂仆仂              3x 亟于仂亶仆舒 仂仆仂

1.8x 仗仂仗仆舒 仗仂仂弍仆仂 仗舒仄亳   Hyper-Q, Dynamic Parallelism

弍舒弍仂从舒 亳亰仂弍舒亢亠仆亳亶, 亳亞仆舒仍仂于,    CFD, FEA, 亳仆舒仆, 亳亰亳从舒
        亠亶仄仂舒亰于亠亟从舒
            丕亢亠 亟仂仗仆仂                仂仗仆仂 于 Q4 2012
                                                                   11
Tesla K10
丐仂亢亠 仗仂亠弍仍亠仆亳亠, 2x 仗仂亳亰于仂亟亳亠仍仆仂 Fermi
  Product Name            M2090                  K10
GPU Architecture           Fermi           Kepler GK104
# of GPUs                    1                    2
                                        Board          Per GPU
Single Precision Flops     1.3 TF      4.58 TF         2.29 TF
Double Precision Flops    0.66 TF     0.190 TF         0.095 TF
# CUDA Cores                512         3072               1536
Memory size                6 GB         8 GB               4GB
Memory BW (ECC off)      177.6 GB/s   320 GB/s         160GB/s
PCI-Express                Gen 2      Gen 3 (Gen 2 compatible)
Board Power              225 watts             225 watts


                                                                  12
K10 亟仍 仆亠亠亞舒亰舒             2
                                  亠亶仄仂舒仆舒仍亳亰

                            1.5
                              1
                            0.5
                              0




    1.8X 亳仄仍亳亶 于 亟亠仆 亟仍
     弍仂仍亠亠 仂仆 仄仂亟亠仍亠亶
    亳亢亠 亳从亳 亳 于亠 仆舒亟亠亢仆仂
    2X GPU 于 仂仄 亢亠 仂仄舒亠
                                                 13
K10 亟仍 仂弍仂仂仆
                                       丼亳仍仂于舒 舒仆舒仍亳亳从舒
                                2
                              1.5
                                1
                              0.5
                                0
                                    M2090         k10



 1.9X 于亳仍亠仆亳亶 于 亟亠仆 亟仍 弍仂仍亠亠 仂仆 仄仂亟亠仍亠亶
 亠亠 舒仆舒仍亳亳从舒 亳 仂仆亠亠 亠亠仆亳
 2X GPU 于 仂仄 亢亠 仂仄舒亠


                                                            14
K10 亟仍 弍亳仂亳仆仂仄舒亳从亳
                          3
                        2.5
                          2
                        1.5
                          1
                        0.5
                          0




  2.2X 亳仄仍亳亶 亟仍 仗亳仍仂亢亠仆亳亶 
  仂仍亳亠 从仗亠亳仄亠仆 仆舒 仄亠仆亳 从仍舒亠舒
  2X GPU 于 仂仄 亢亠 仂仄舒亠

                                              Gromacs 4.6 pre-beta version
                               * 2 instances of AMBER 12 (with beta patch)
                                                                    15
Tesla K10 vs M2090: 2x 仗仂亳亰于仂亟亳亠仍仆仂 / 舒
   2.50




   2.00




   1.50




   1.00




   0.50




   0.00
           Seismic     LAMMPS   NAMD   AMBER*        Radio         Nbody        Defense
          Processing                              Astronomy                  (Integer Ops)
                                                Cross-Correlator

                                                             * 2 instances of AMBER running JAC   16
118 从仂仄仄亠亠从亳 仗亳仍仂亢亠仆亳亶 从仂ム 仆舒 GPU




                  www.nvidia.com/teslaapps
                                             17
MSC Nastran 亠仆舒/仗仂亳亰于仂亟亳亠仍仆仂
                 亠亠仆亳 MSC Nastran 2012 and Model 3.4M DOF
                  NOTE: Based on

                                                                                                                                  Extra 13% cost
                                              Results from PSG cluster node (fs0), 2x Nehalem 2.27GHz,
                                         6                                                                                          yields 160%
Factors Gain Over Base License Results




                                              96GB memory, Linux/CentOS; 2x Tesla C2050, CUDA 4.0
                                                                                                                                    performance
                                                                                                                                   (over 8 cores)                   *
                                                                                                                                                    Solution Cost Basis
                                                                                                                                                    - Linear Structures Package
                                         5        CPU Speed-up                                                              5.3
                                                                                                                                                      (Base SMP license)
                                                  GPU Speed-up                                              4.6
                                         4        Solution Cost
                                                                                                                                                    - Expert Package
                                                                                                                                                    (Nonlinear)
                                         3                                             3.3                                                          - Implicit HPC Package
                                                                                                                                                      (DMP Network License)
                                         2                         2.6                                                                              - GPU License
                                                                                                                                                    - $10K for System cost
                                         1                                                   1.24                                 1.4               - $4K for 2x Tesla 20-series
                                               1.0 1.0                   1.0                                      1.13
                                                                                                                                                    Performance Basis
                                         0                                                                                                          SOL101 Model:
                                                                                                                                                    - 3.4M DOF
                                                                                                                                                    - Stress analysis
                                             Nastran SMP       Nastran SMP         Nastran DMP            Nastran SMP     Nastran DMP               - Direct sparse
                                               License           4 Cores             8 Cores             + GPU License + GPU License
                                               1 Core                                                    1 Core + 1 GPU 2 Cores + 2 GPUs            * 1 year lease for SW pricing
                                                                                                                                                                            18
仂亞舒仄仄亳仂于舒仆亳亠 GPU



                       19
20
NVIDIA cuBLAS     NVIDIA cuRAND      NVIDIA cuSPARSE        NVIDIA NPP




  Vector Signal    GPU Accelerated   Matrix Algebra on
Image Processing    Linear Algebra   GPU and Multicore      NVIDIA cuFFT




                    Sparse Linear       Building-block     C++ STL Features
  IMSL Library         Algebra       Algorithms for CUDA       for CUDA



                            亳弍仍亳仂亠从亳 亟仍 GPU
                            Copy-paste 亟仍 从仂亠仆亳 仗亳仍仂亢亠仆亳亶
                                                                              21
亳亠从亳于 OpenACC
      CPU                        GPU



                                                          仂亠 从舒亰舒亠仍亳 亟仍
                                                              从仂仄仗亳仍仂舒
Program myscience
   ... serial code ...
!$acc kernels                                            仂仄仗亳仍仂 仗舒舒仍仍亠仍亳亰亠
                                                                   从仂亟
   do k = 1,n1
      do i = 1,n2
                                        OpenACC 仄亠从亳
         ... parallel code ...         亟仍 从仂仄仗亳仍仂舒
      enddo

                                                         舒弍仂舒亠 仆舒 仄仆仂亞仂磲亠仆
    enddo
!$acc end kernels
  ...
End Program myscience                                         CPU 亳 仄舒亳于仆仂
  仂亟仆亶 从仂亟                                              仗舒舒仍仍亠仍仆 GPU
   仆舒 C/Fortran                                                                   22
亳仆亳仄仄 亳仍亳亶. 亳仄亶 亠亰仍舒


 仂亟亠仍 亢亳亰仆亠仆仆仂亞仂        于亠亰亟 亳 亞舒仍舒从亳从亳             亠亶仂亠亳 亟仍
亳从仍舒 仄仂从仂亶 舒仆       12.5 仄仍亟 仍亠 仆舒亰舒亟        舒仄仂仂弍舒亠仄 仂弍仂仂于
  丕仆亳于亠亳亠 亠仍弍仆舒     丕仆亳于亠亳亠 仂仆亳仆亞亠仆舒       丕仆亳于亠亳亠 仍亳仄舒




65x 亰舒 2 亟仍              5.6x 亰舒 5 亟仆亠亶             4.7x 亰舒 4 舒舒
                                                                               23
仂从仂仗 仗仂 OpenACC
    于 仗亠从仂仄仗ム亠仆仂仄 亠仆亠 亳弍亞舒


                    从仂仆 于仂仂亞仂 亟仆
仗仂仍亠仆仂 10-从舒仆仂亠 从仂亠仆亳亠 仂亟仆仂亞仂 亳亰 舒仄仂亠仆 磲亠
                              6 亟亳亠从亳于

                    Technology Director
        National Center for Atmospheric
                       Research (NCAR)


                                                      24
仂亟亟亠亢从舒 磶从仂于 C, C++, Fortran 仄仂亟亠仍
 仗舒舒仍仍亠仍仆仂亞仂 仗仂亞舒仄仄亳仂于舒仆亳 CUDA
                              GPU Computing Applications
                    Libraries and Middleware
  cuFFT                                      PhysX
           LAPACK     NPP       VSIPL                       iray
 cuBLAS                                       Video                      MATLAB
            CULA     cuDPP       SVM                     Rendering
 cuRAND                                     OptiX Ray                   Mathematica
           MAGMA     Thrust   OpenCurrent               RealityServer
cuSPARSE                                     tracing




                                                                     Java
                                                                    Python                     Direct
     C++                  C                 Fortran                                                                              OpenCL           tm



                                                                   Wrappers                   Compute




                              NVIDIA GPU
                              CUDA Parallel Computing Architecture
                                                                                      OpenCL is trademark of Apple Inc. used under license to the Khronos Group25
                                                                                                                                                                Inc.
C 亟仍 CUDA : C + 束亳仆舒从亳亠从亳亶 舒舒損
    void saxpy_serial(int n, float a, float *x, float *y)
    {
        for (int i = 0; i < n; ++i)
            y[i] = a*x[i] + y[i];
    }                                           弌舒仆亟舒仆亶      从仂亟 C
    // Invoke serial SAXPY kernel
    saxpy_serial(n, 2.0, x, y);


    __global__ void saxpy_parallel(int n, float a, float *x, float *y)
    {
        int i = blockIdx.x*blockDim.x + threadIdx.x;
        if (i < n) y[i] = a*x[i] + y[i];
    }
                                                舒舒仍仍亠仍仆亶 从仂亟         C
    // Invoke parallel SAXPY kernel with 256 threads/block
    int nblocks = (n + 255) / 256;
    saxpy_parallel<<<nblocks, 256>>>(n, 2.0, x, y);

                                                                         26
NVIDIA 亟亠仍舒亠 仗仍舒仂仄 CUDA 仂从仂亶 弍仍舒亞仂亟舒 LLVM

                                                      CUDA          仂亟亟亠亢从舒
 CUDA 弍从亠仆亟 亠仗亠 亟仂仗亠仆 亟仍 LLVM            C, C++, Fortran   仆仂于 磶从仂于
             从仂仄仗亳仍仂舒

SDK 于从仍ム舒亠 亟仂从仄亠仆舒亳, 仗亳仄亠 亳
            于亠亳亳从舒仂                                  LLVM 从仂仄仗亳仍仂
                                                            亟仍 CUDA
     仂亰仄仂亢仆仂 亟仂弍舒于仍亠仆亳
 仗仂亟亟亠亢从亳 CUDA 于 仆仂于亠 磶从亳 亳
          仗仂亠仂                            NVIDIA       x86      仂亟亟亠亢从舒
                                                 GPUs       CPUs   仂于 仗仂亠仂仂于


                    仂亟仂弍仆仂亳
      http://developer.nvidia.com/cuda-source

                                                                                       27
Kepler: 于仗亠于亠 仗仂仍仆仂亠仆仆舒 仗仂亟亟亠亢从舒 GPUDirect


 System                                                        System
 Memory      GDDR5    GDDR5               GDDR5     GDDR5      Memory
             Memory   Memory              Memory    Memory




 CPU         GPU1     GPU2                GPU2      GPU1        CPU


            PCI-e                                      PCI-e
                      Network   Network   Network
                       Card                Card



          弌亠于亠 1                                  弌亠于亠 2
                                                                        28
CUDA    于 亳舒:
>375,000,000   CUDA GPU 仆舒 仆从亠
  >1,000,000   从舒亳于舒仆亳亶 SDK
   >120,000    舒从亳于仆 舒亰舒弍仂亳从仂于
       >500    仆亳于亠亳亠仂于 仗亠仗仂亟舒ム CUDA

                                              29
丼仂 亟舒仍亠?




              30
CUDA 亟仍 ARM
                                                 仍亠亟仂于舒亠仍从舒 仗仍舒仂仄舒
            CUDA GPU         Tegra ARM CPU           4- 磲亠仆亶 仗仂亠仂
                                                     NVIDIA Tegra 3 仆舒 弍舒亰亠 ARM
                                                     NVIDIA CUDA GPU
                                                     Gbit 亠

舒弍仂 亟仍 舒亰舒弍仂亳从仂于                                   CUDA SDK

http://www.secoqseven.com/en/item/secocq7-mxm/
                                                       仂仗仆仂 亠亶舒
                                                                            31

More Related Content

Nvidia kepler architecture performance efficiency availability @ hpcday 2012 kiev

  • 2. Tesla: 于 2-3 舒亰舒 弍亠亠 从舒亢亟亠 2 亞仂亟舒 16 Maxwell 14 DP GFLOPS per Watt 12 10 8 6 Kepler 4 Fermi 2 T10 2008 2010 2012 2014 2
  • 3. Kepler 3
  • 4. Kepler 弌乘 乘弌丐亊 亅个个丐亊 HPC 丱丐丐丕 SMX Hyper-Q Dynamic Parallelism 4
  • 5. Kepler: 弌从仂仂 亳 亅亠从亳于仆仂 SM SMX M2090 K20 3x 丕亊豫乂亊 丕亊豫乂亊 Perf / Watt 32 磲舒 192 磲舒 5
  • 6. 1 亠仍仂仗 亠亞仂 于 10 仂亶从舒 400 从 6
  • 7. Hyper-Q CPU 磲舒 仂亟仆仂于亠仄亠仆仆仂 亰舒仗从舒ム 亰舒亟舒亳 仆舒 Kepler FERMI KEPLER 1 MPI 亰舒亟舒舒 仂亟仆仂于亠仄亠仆仆仂 32 MPI 亰舒亟舒亳 仂亟仆仂于亠仄亠仆仆仂 7
  • 8. Hyper-Q 舒从亳仄舒仍仆舒 亳仍亳亰舒亳 GPU, 仂从舒亠仆亳亠 于亠仄亠仆亳 仗仂仂 CPU 100 100 丕亳仍亳亰舒亳 GPU % 丕亳仍亳亰舒亳 GPU % 50 50 0 0 Time Time 8
  • 9. Dynamic Parallelism GPU 舒亟舒仗亳亠 从 亟舒仆仆仄, 亟亳仆舒仄亳亠从亳 仗仂仂亢亟舒 仆仂于亠 仗仂仂从亳 CPU Fermi GPU CPU Kepler GPU 9
  • 10. Dynamic Parallelism 仂亞舒仄仄亳仂于舒仆亳亠 仆舒 GPU 仗仂亠 亳 亟仂仗仆亠亠 弌仍亳从仂仄 亞弍仂 弌仍亳从仂仄 仄亠仍从仂 舒从 仆舒亟仂 10
  • 11. Tesla K10 Tesla K20 3x 仂亟亳仆舒仆舒 仂仆仂 3x 亟于仂亶仆舒 仂仆仂 1.8x 仗仂仗仆舒 仗仂仂弍仆仂 仗舒仄亳 Hyper-Q, Dynamic Parallelism 弍舒弍仂从舒 亳亰仂弍舒亢亠仆亳亶, 亳亞仆舒仍仂于, CFD, FEA, 亳仆舒仆, 亳亰亳从舒 亠亶仄仂舒亰于亠亟从舒 丕亢亠 亟仂仗仆仂 仂仗仆仂 于 Q4 2012 11
  • 12. Tesla K10 丐仂亢亠 仗仂亠弍仍亠仆亳亠, 2x 仗仂亳亰于仂亟亳亠仍仆仂 Fermi Product Name M2090 K10 GPU Architecture Fermi Kepler GK104 # of GPUs 1 2 Board Per GPU Single Precision Flops 1.3 TF 4.58 TF 2.29 TF Double Precision Flops 0.66 TF 0.190 TF 0.095 TF # CUDA Cores 512 3072 1536 Memory size 6 GB 8 GB 4GB Memory BW (ECC off) 177.6 GB/s 320 GB/s 160GB/s PCI-Express Gen 2 Gen 3 (Gen 2 compatible) Board Power 225 watts 225 watts 12
  • 13. K10 亟仍 仆亠亠亞舒亰舒 2 亠亶仄仂舒仆舒仍亳亰 1.5 1 0.5 0 1.8X 亳仄仍亳亶 于 亟亠仆 亟仍 弍仂仍亠亠 仂仆 仄仂亟亠仍亠亶 亳亢亠 亳从亳 亳 于亠 仆舒亟亠亢仆仂 2X GPU 于 仂仄 亢亠 仂仄舒亠 13
  • 14. K10 亟仍 仂弍仂仂仆 丼亳仍仂于舒 舒仆舒仍亳亳从舒 2 1.5 1 0.5 0 M2090 k10 1.9X 于亳仍亠仆亳亶 于 亟亠仆 亟仍 弍仂仍亠亠 仂仆 仄仂亟亠仍亠亶 亠亠 舒仆舒仍亳亳从舒 亳 仂仆亠亠 亠亠仆亳 2X GPU 于 仂仄 亢亠 仂仄舒亠 14
  • 15. K10 亟仍 弍亳仂亳仆仂仄舒亳从亳 3 2.5 2 1.5 1 0.5 0 2.2X 亳仄仍亳亶 亟仍 仗亳仍仂亢亠仆亳亶 仂仍亳亠 从仗亠亳仄亠仆 仆舒 仄亠仆亳 从仍舒亠舒 2X GPU 于 仂仄 亢亠 仂仄舒亠 Gromacs 4.6 pre-beta version * 2 instances of AMBER 12 (with beta patch) 15
  • 16. Tesla K10 vs M2090: 2x 仗仂亳亰于仂亟亳亠仍仆仂 / 舒 2.50 2.00 1.50 1.00 0.50 0.00 Seismic LAMMPS NAMD AMBER* Radio Nbody Defense Processing Astronomy (Integer Ops) Cross-Correlator * 2 instances of AMBER running JAC 16
  • 17. 118 从仂仄仄亠亠从亳 仗亳仍仂亢亠仆亳亶 从仂ム 仆舒 GPU www.nvidia.com/teslaapps 17
  • 18. MSC Nastran 亠仆舒/仗仂亳亰于仂亟亳亠仍仆仂 亠亠仆亳 MSC Nastran 2012 and Model 3.4M DOF NOTE: Based on Extra 13% cost Results from PSG cluster node (fs0), 2x Nehalem 2.27GHz, 6 yields 160% Factors Gain Over Base License Results 96GB memory, Linux/CentOS; 2x Tesla C2050, CUDA 4.0 performance (over 8 cores) * Solution Cost Basis - Linear Structures Package 5 CPU Speed-up 5.3 (Base SMP license) GPU Speed-up 4.6 4 Solution Cost - Expert Package (Nonlinear) 3 3.3 - Implicit HPC Package (DMP Network License) 2 2.6 - GPU License - $10K for System cost 1 1.24 1.4 - $4K for 2x Tesla 20-series 1.0 1.0 1.0 1.13 Performance Basis 0 SOL101 Model: - 3.4M DOF - Stress analysis Nastran SMP Nastran SMP Nastran DMP Nastran SMP Nastran DMP - Direct sparse License 4 Cores 8 Cores + GPU License + GPU License 1 Core 1 Core + 1 GPU 2 Cores + 2 GPUs * 1 year lease for SW pricing 18
  • 20. 20
  • 21. NVIDIA cuBLAS NVIDIA cuRAND NVIDIA cuSPARSE NVIDIA NPP Vector Signal GPU Accelerated Matrix Algebra on Image Processing Linear Algebra GPU and Multicore NVIDIA cuFFT Sparse Linear Building-block C++ STL Features IMSL Library Algebra Algorithms for CUDA for CUDA 亳弍仍亳仂亠从亳 亟仍 GPU Copy-paste 亟仍 从仂亠仆亳 仗亳仍仂亢亠仆亳亶 21
  • 22. 亳亠从亳于 OpenACC CPU GPU 仂亠 从舒亰舒亠仍亳 亟仍 从仂仄仗亳仍仂舒 Program myscience ... serial code ... !$acc kernels 仂仄仗亳仍仂 仗舒舒仍仍亠仍亳亰亠 从仂亟 do k = 1,n1 do i = 1,n2 OpenACC 仄亠从亳 ... parallel code ... 亟仍 从仂仄仗亳仍仂舒 enddo 舒弍仂舒亠 仆舒 仄仆仂亞仂磲亠仆 enddo !$acc end kernels ... End Program myscience CPU 亳 仄舒亳于仆仂 仂亟仆亶 从仂亟 仗舒舒仍仍亠仍仆 GPU 仆舒 C/Fortran 22
  • 23. 亳仆亳仄仄 亳仍亳亶. 亳仄亶 亠亰仍舒 仂亟亠仍 亢亳亰仆亠仆仆仂亞仂 于亠亰亟 亳 亞舒仍舒从亳从亳 亠亶仂亠亳 亟仍 亳从仍舒 仄仂从仂亶 舒仆 12.5 仄仍亟 仍亠 仆舒亰舒亟 舒仄仂仂弍舒亠仄 仂弍仂仂于 丕仆亳于亠亳亠 亠仍弍仆舒 丕仆亳于亠亳亠 仂仆亳仆亞亠仆舒 丕仆亳于亠亳亠 仍亳仄舒 65x 亰舒 2 亟仍 5.6x 亰舒 5 亟仆亠亶 4.7x 亰舒 4 舒舒 23
  • 24. 仂从仂仗 仗仂 OpenACC 于 仗亠从仂仄仗ム亠仆仂仄 亠仆亠 亳弍亞舒 从仂仆 于仂仂亞仂 亟仆 仗仂仍亠仆仂 10-从舒仆仂亠 从仂亠仆亳亠 仂亟仆仂亞仂 亳亰 舒仄仂亠仆 磲亠 6 亟亳亠从亳于 Technology Director National Center for Atmospheric Research (NCAR) 24
  • 25. 仂亟亟亠亢从舒 磶从仂于 C, C++, Fortran 仄仂亟亠仍 仗舒舒仍仍亠仍仆仂亞仂 仗仂亞舒仄仄亳仂于舒仆亳 CUDA GPU Computing Applications Libraries and Middleware cuFFT PhysX LAPACK NPP VSIPL iray cuBLAS Video MATLAB CULA cuDPP SVM Rendering cuRAND OptiX Ray Mathematica MAGMA Thrust OpenCurrent RealityServer cuSPARSE tracing Java Python Direct C++ C Fortran OpenCL tm Wrappers Compute NVIDIA GPU CUDA Parallel Computing Architecture OpenCL is trademark of Apple Inc. used under license to the Khronos Group25 Inc.
  • 26. C 亟仍 CUDA : C + 束亳仆舒从亳亠从亳亶 舒舒損 void saxpy_serial(int n, float a, float *x, float *y) { for (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i]; } 弌舒仆亟舒仆亶 从仂亟 C // Invoke serial SAXPY kernel saxpy_serial(n, 2.0, x, y); __global__ void saxpy_parallel(int n, float a, float *x, float *y) { int i = blockIdx.x*blockDim.x + threadIdx.x; if (i < n) y[i] = a*x[i] + y[i]; } 舒舒仍仍亠仍仆亶 从仂亟 C // Invoke parallel SAXPY kernel with 256 threads/block int nblocks = (n + 255) / 256; saxpy_parallel<<<nblocks, 256>>>(n, 2.0, x, y); 26
  • 27. NVIDIA 亟亠仍舒亠 仗仍舒仂仄 CUDA 仂从仂亶 弍仍舒亞仂亟舒 LLVM CUDA 仂亟亟亠亢从舒 CUDA 弍从亠仆亟 亠仗亠 亟仂仗亠仆 亟仍 LLVM C, C++, Fortran 仆仂于 磶从仂于 从仂仄仗亳仍仂舒 SDK 于从仍ム舒亠 亟仂从仄亠仆舒亳, 仗亳仄亠 亳 于亠亳亳从舒仂 LLVM 从仂仄仗亳仍仂 亟仍 CUDA 仂亰仄仂亢仆仂 亟仂弍舒于仍亠仆亳 仗仂亟亟亠亢从亳 CUDA 于 仆仂于亠 磶从亳 亳 仗仂亠仂 NVIDIA x86 仂亟亟亠亢从舒 GPUs CPUs 仂于 仗仂亠仂仂于 仂亟仂弍仆仂亳 http://developer.nvidia.com/cuda-source 27
  • 28. Kepler: 于仗亠于亠 仗仂仍仆仂亠仆仆舒 仗仂亟亟亠亢从舒 GPUDirect System System Memory GDDR5 GDDR5 GDDR5 GDDR5 Memory Memory Memory Memory Memory CPU GPU1 GPU2 GPU2 GPU1 CPU PCI-e PCI-e Network Network Network Card Card 弌亠于亠 1 弌亠于亠 2 28
  • 29. CUDA 于 亳舒: >375,000,000 CUDA GPU 仆舒 仆从亠 >1,000,000 从舒亳于舒仆亳亶 SDK >120,000 舒从亳于仆 舒亰舒弍仂亳从仂于 >500 仆亳于亠亳亠仂于 仗亠仗仂亟舒ム CUDA 29
  • 31. CUDA 亟仍 ARM 仍亠亟仂于舒亠仍从舒 仗仍舒仂仄舒 CUDA GPU Tegra ARM CPU 4- 磲亠仆亶 仗仂亠仂 NVIDIA Tegra 3 仆舒 弍舒亰亠 ARM NVIDIA CUDA GPU Gbit 亠 舒弍仂 亟仍 舒亰舒弍仂亳从仂于 CUDA SDK http://www.secoqseven.com/en/item/secocq7-mxm/ 仂仗仆仂 亠亶舒 31