際際滷

際際滷Share a Scribd company logo
CUDA Distributed Computing
14 2
Contents
I. 覲襦蠏碁覦
1. 螳牛 螳
2. CUDA Distributed Computing   危
II. 螳覦蟆盾
1. 螳覦蟆 蟲螻殊
2. 覲豌襴 ろ觜
I. 覲襦蠏碁覦
Multi Core Many Core
Case 1. 螻煙
a b
1 3 5 7
2 4 6 8
 る螳
8覯 access
蠍一ヾ 襦蠏碁 CPU Core螳
覈 一一一 豌襴. 讌襷
Matrix 一 螳 覦覲給語 襷 一
襦蠏碁 企ゼ 豌襴  
蠏 讌螻殊 譬讌 覈詩. Thread襯
襴蠍一 CPU Core 螳 .
for {
for {
mat(i)*mat(j);
}
}
Case 1. 螻煙
a b
c d
1 1 1 1
2 2 2 2
螳螳 
 讌,
朱 GPU Global Memory 
Data copy  螳螳 CUDA core
螳  Matrix cell 螳螳 覲襦
豌襴蟆 . 讀, CUDA core 
螳 襷 襦 覯 覲襦 豌襴
  殊 螳 企蟆 .
觜螻
CUDA 6 覯 覿磯 覃覈襴 牛
 http://www.theregister.co.uk/2013/11/16/
nvidia_reveals_cuda_6_joins_cpugpu_shared_memory_party/
Linux CUDA 螳覦蟆 蟲
14. 2. 25 ()
覈谿
螳覦蟆 譟一奄
- Ubuntu Linux 12.04 Desktop
- nVidia Graphic driver
- GCC Compiler (v4.6)
- CUDA Toolkit (5.5)
nsight for eclipse
git 伎 襦 豢螳
1. Ubuntu Linux 12.04
http://www.ubuntu.com/
* 覈 ろ蟆曙 Ubuntu 12.04 襦 旧
2. Graphic Driver
CUDA襯 讌 蠏碁曙拘 語 .
$ lspci | grep -i nvidia



れ 危碁ゼ 牛 貎 讌覿 Compute Capability 襯 .
* https://developer.nvidia.com/cuda-gpus
豢ル語 朱 殊企 覯 一危 .
http://docs.nvidia.com/cuda/cuda-getting-started-guide-for-linux/index.html#system-requirements
* 襴 伎豌伎 れ  れ 蟆 企, 覩語れ  危碁ゼ 谿瑚.
GCC Compiler れ
$ sudo apt-cache search gcc // Repository searching
$ sudo apt-get install gcc-4.6 // Install GCC v4.6








* 4.8 覯 貉危殊 覦 誤蠍一, 螳ロ覃 4.6覯 蟠.
3. GCC Compiler
4. CUDA Toolkit
1. Terminal  牛 CUDA れ
$ sudo apt-get update
$ sudo apt-get install cuda -y

轟 伎 .run れ企 / れ 
> https://developer.nvidia.com/cuda-downloads
2. 蟆暑 れ 
1) home 襴 ls -a 覈轟企 .bashrc 覓 呉
2) .bashrc vim 牛, れ 伎 蠍一
export PATH=/usr/local/cuda-5.5/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-5.5/lib64:$LD_LIBRARY_PATH
3) source ~/.bashrc 襦 蟆暑 
http://docs.nvidia.com/cuda/cuda-getting-started-guide-for-linux/index.html#system-requirements
轟 朱
touch 覈轟企
nSight ろ
git Core れ
Work with: http://download.eclipse.org/releases/juno
help
Eclipse Egit ,
Mylyn gitHub Feature,
Eclipse jGit,
Eclipse Mylyn
鍖mpeg Project Importing
https://trac.ffmpeg.org/wiki/How%20to%20setup%20Eclipse%20IDE%20for%20FFmpeg%20development
ffmpeg 危語 螻牛 れ覦覯 URL
 
Next
 
Next
ffmpeg Source git 蟆暑 :
git://source.ffmpeg.org/ffmpeg.git
Deselect All,
Master 觚豺襷 
 Next
覦 襦碁ゼ 
螻 苦 襴 豺 
  Next
Cloning 覃 企
襦碁ゼ
ffmpeg optimization using CUDA
焔 襦碁ゼ Terminal 牛 企 启 con鍖gure,
襭覃 觜螳 螳ロ 蟆曙 譟一焔.
ffmpeg optimization using CUDA
觜 襭  ろ
CUDA - DCT Processing Optimization
14. 3. 11 ()
http://en.wikipedia.org/wiki/Discrete_cosine_transform
覈谿
 譯狩 覲譟 一謂
- DST Processing in ffmpeg 
- GOLD 覯 vs CUDA 覯
 譯狩 覲譟 豕 螻殊
 CPU / GPU 一一焔ル蟲
-  
- 螻煙 一 螳 觜蟲
- 譯狩 覲譟 豕 蟆郁骸
譯狩 覲譟 一一   
 static void FUNC(transform_32x32_add)()
 static void FUNC(transform_16x16_add)()
 static void FUNC(transform_8x8_add)()

DST Processing in 鍖mpeg
* <ffmpeg> libavcodec/hevcdsp_template.c
static void FUNC(transform_32x32_add)(uint8_t *_dst, int16_t *coeffs,
ptrdiff_t stride)
{
int i;
pixel *dst = (pixel *)_dst;
int shift = 7;
int add = 1 << (shift - 1);
int16_t *src = /yyooooon/ffmpeg-optimization-using-cuda/coeffs;
stride /= sizeof(pixel);
for (i = 0; i < 32; i++) {
TR_32(src, src, 32, 32, SCALE);
src++;
}
src = /yyooooon/ffmpeg-optimization-using-cuda/coeffs;
shift = 20 - BIT_DEPTH;
add = 1 << (shift - 1);
for (i = 0; i < 32; i++) {
TR_32(dst, coeffs, 1, 1, ADD_AND_SCALE);
coeffs += 32;
dst += stride;
}
}
 32
transform_32x32_add()
static void FUNC(transform_32x32_add)(uint8_t *_dst, int16_t *coeffs,
ptrdiff_t stride)
{
int i;
pixel *dst = (pixel *)_dst;
int shift = 7;
int add = 1 << (shift - 1);
int16_t *src = /yyooooon/ffmpeg-optimization-using-cuda/coeffs;
stride /= sizeof(pixel);
for (i = 0; i < 32; i++) {
TR_32(src, src, 32, 32, SCALE);
src++;
}
src = /yyooooon/ffmpeg-optimization-using-cuda/coeffs;
shift = 20 - BIT_DEPTH;
add = 1 << (shift - 1);
for (i = 0; i < 32; i++) {
TR_32(dst, coeffs, 1, 1, ADD_AND_SCALE);
coeffs += 32;
dst += stride;
}
}
transform_32x32_add()
#define TR_32(dst, src, dstep, sstep, assign) 
do { 
int i, j; 
int e_32[16]; 
int o_32[16] = { 0 }; 
for (i = 0; i < 16; i++) 
for (j = 1; j < 32; j += 2) 
o_32[i] += transform[j][i] * src[j * sstep]; 
TR_16(e_32, src, 1, 2 * sstep, SET); 

for (i = 0; i < 16; i++) { 
assign(dst[i * dstep], e_32[i] + o_32[i]); 
assign(dst[(31 - i) * dstep], e_32[i] - o_32[i]); 
} 
} while (0)
 32* (16*32)
512
static void FUNC(transform_32x32_add)(uint8_t *_dst, int16_t *coeffs,
ptrdiff_t stride)
{
int i;
pixel *dst = (pixel *)_dst;
int shift = 7;
int add = 1 << (shift - 1);
int16_t *src = /yyooooon/ffmpeg-optimization-using-cuda/coeffs;
stride /= sizeof(pixel);
for (i = 0; i < 32; i++) {
TR_32(src, src, 32, 32, SCALE);
src++;
}
src = /yyooooon/ffmpeg-optimization-using-cuda/coeffs;
shift = 20 - BIT_DEPTH;
add = 1 << (shift - 1);
for (i = 0; i < 32; i++) {
TR_32(dst, coeffs, 1, 1, ADD_AND_SCALE);
coeffs += 32;
dst += stride;
}
}
transform_32x32_add()
#define TR_32(dst, src, dstep, sstep, assign) 
do { 
int i, j; 
int e_32[16]; 
int o_32[16] = { 0 }; 
for (i = 0; i < 16; i++) 
for (j = 1; j < 32; j += 2) 
o_32[i] += transform[j][i] * src[j * sstep]; 
TR_16(e_32, src, 1, 2 * sstep, SET); 

for (i = 0; i < 16; i++) { 
assign(dst[i * dstep], e_32[i] + o_32[i]); 
assign(dst[(31 - i) * dstep], e_32[i] - o_32[i]); 
} 
} while (0)
 32* [(16*32)+(8*16)]
#define TR_16(dst, src, dstep, sstep, assign) 
do { 
int i, j; 
int e_16[8]; 
int o_16[8] = { 0 }; 
for (i = 0; i < 8; i++) 
for (j = 1; j < 16; j += 2) 
o_16[i] += transform[2 * j][i] * src[j * sstep]; 
TR_8(e_16, src, 1, 2 * sstep, SET); 

for (i = 0; i < 8; i++) { 
assign(dst[i * dstep], e_16[i] + o_16[i]); 
assign(dst[(15 - i) * dstep], e_16[i] - o_16[i]); 
} 
} while (0)
512 128
static void FUNC(transform_32x32_add)(uint8_t *_dst, int16_t *coeffs,
ptrdiff_t stride)
{
int i;
pixel *dst = (pixel *)_dst;
int shift = 7;
int add = 1 << (shift - 1);
int16_t *src = /yyooooon/ffmpeg-optimization-using-cuda/coeffs;
stride /= sizeof(pixel);
for (i = 0; i < 32; i++) {
TR_32(src, src, 32, 32, SCALE);
src++;
}
src = /yyooooon/ffmpeg-optimization-using-cuda/coeffs;
shift = 20 - BIT_DEPTH;
add = 1 << (shift - 1);
for (i = 0; i < 32; i++) {
TR_32(dst, coeffs, 1, 1, ADD_AND_SCALE);
coeffs += 32;
dst += stride;
}
}
transform_32x32_add()
#define TR_32(dst, src, dstep, sstep, assign) 
do { 
int i, j; 
int e_32[16]; 
int o_32[16] = { 0 }; 
for (i = 0; i < 16; i++) 
for (j = 1; j < 32; j += 2) 
o_32[i] += transform[j][i] * src[j * sstep]; 
TR_16(e_32, src, 1, 2 * sstep, SET); 

for (i = 0; i < 16; i++) { 
assign(dst[i * dstep], e_32[i] + o_32[i]); 
assign(dst[(31 - i) * dstep], e_32[i] - o_32[i]); 
} 
} while (0)
 32*{[(16*32)+(8*16)]+(4*8)}
#define TR_16(dst, src, dstep, sstep, assign) 
do { 
int i, j; 
int e_16[8]; 
int o_16[8] = { 0 }; 
for (i = 0; i < 8; i++) 
for (j = 1; j < 16; j += 2) 
o_16[i] += transform[2 * j][i] * src[j * sstep]; 
TR_8(e_16, src, 1, 2 * sstep, SET); 

for (i = 0; i < 8; i++) { 
assign(dst[i * dstep], e_16[i] + o_16[i]); 
assign(dst[(15 - i) * dstep], e_16[i] - o_16[i]); 
} 
} while (0)
#define TR_8(dst, src, dstep, sstep, assign) 
do { 
int i, j; 
int e_8[4]; 
int o_8[4] = { 0 }; 
for (i = 0; i < 4; i++) 
for (j = 1; j < 8; j += 2) 
o_8[i] += transform[4 * j][i] * src[j * sstep]; 
TR_4(e_8, src, 1, 2 * sstep, SET); 

for (i = 0; i < 4; i++) { 
assign(dst[i * dstep], e_8[i] + o_8[i]); 
assign(dst[(7 - i) * dstep], e_8[i] - o_8[i]); 
} 
} while (0)
512 128 32
transform_32x32_add()
TR_32 (512) TR_16 (128) TR_8 (32) TR_4 (8)
X 32
= 21,760
X 2 = 43,520
GOLD 覯 一壱
GOLD Ver. vs CUDA Ver.
: 512 threads
= 512
= 1
for (i = 0; i < 16; i++) 
for (j = 1; j < 32; j += 2) 
o_32[i] += transform[j][i] * src[j * sstep]; 
i
j
transform_32x32_add()
GOLD Ver. vs CUDA Ver.
TR_32 (512) TR_16 (128) TR_8 (32) TR_4 (8)
X 32
= 21,760
TR_32 (1) TR_16 (1) TR_8 (1) TR_4 (8)
X 32
= 352
GOLD Ver.
CUDA Ver.
transform_32x32_add()
GOLD Ver. vs CUDA Ver.
Ver.  DST 32 X 32 16 X 16 8 X 8
GOLD 43,520 5376 640
CUDA 704 320 144
X 61.8 X 16.8 X 4.4
一壱 觜蟲
* o_n[i] += transform[n * j][i] * src[j * sstep];
企ゼ   一壱 豐 襯 覩誤.
譯狩 覲譟 豕 螻殊
 make鍖le
 libavcodec
 make鍖le
 hevcdsp.h
 hevcdsp.c
 hevcdsp_template.c
 (+) hevcdsp_CUDA_functions.cu
* 覈 CUDA  hevcdsp_CUDA_functions.cu  
  給. hevedsp.c, hevcdsp_tmplate.c
  螻褐 CUDA 襯 螳語 .
豕 make鍖le
LIBS-ffmpeg += -L /usr/local/cuda/lib64 -lcudart
LIBS-ffprobe += -L /usr/local/cuda/lib64 -lcudart
LIBS-ffserver += -L /usr/local/cuda/lib64 -lcudart
≒≒≒≒≒
libavcodec/hevcdsp_CUDA_functions.o: libavcodec/hevcdsp_CUDA_functions.cu
/usr/local/cuda-5.5/bin/nvcc -G -g -O0 -gencode arch=compute_10,code=sm_10 -odir "."
-M -o "libavcodec/hevcdsp_CUDA_functions.d" libavcodec/hevcdsp_CUDA_functions.cu"
/usr/local/cuda-5.5/bin/nvcc --compile -G -O0 -g -gencode
arch=compute_10,code=compute_10 -gencode arch=compute_10,code=sm_10 -x cu -o
"libavcodec/hevcdsp_CUDA_functions.o" "libavcodec/hevcdsp_CUDA_functions.cu"
* 鍖mpeg 煙 cuda-template   蟆  譴.
* .cu 鍖le  觜讌 豕 make鍖le 讌.
libavcodec - make鍖le
OBJS-$(CONFIG_HEVC_DECODER) += hevcdsp_CUDA_functions.o
libavcodec - hevcdsp.h
// (Yoon) bgn ...
void DP_Copy_transform_ToCudaMem();
void DP_Free_transform_FromCudaMem();
void DP_TR8_Add(int8_t *T, int8_t *S, int8_t *O, int
sstep);
≒≒≒≒≒
// (Yoon) ... end
* hevcdsp_CUDA_functions.cu 碁  
hevcdsp.h 誤   蟆 .
libavcodec - hevcdsp_template.c
#define TR_8(dst, src, dstep, sstep, assign) 
do { 
int e_8[4]; 
int o_8[4] = { 0 }; 

DP_Copy_src_ToCudaMem(8, sstep); 
DP_TR8_Add(o_8, sstep); 
DP_Free_src_FromCudaMem(); 

TR_4(e_8, src, 1, 2 * sstep, SET); 

for (i = 0; i < 4; i++) { 
assign(dst[i * dstep], e_8[i] + o_8[i]); 
assign(dst[(7 - i) * dstep], e_8[i] - o_8[i]); 
} 
} while (0)
libavcodec - hevcdsp_CUDA_functions.cu
 void DP_Copy_transform_ToCudaMem( )
 void DP_TR8_Add (int *o_8, int sstep)
 __global__ void TR8_PARALLEL_ADD (int8_t *T, int8_t
*S, int8_t *O, int sstep)
襦語 CORE i5-3230M GeForce GT 740M
企 2.60 GHz 1.03 GHz
貊願渥 2 Cores 384 CUDA Cores
一一焔ル蟲
384
CORE
vs
* CPU 焔レ Windows PC覲 襯 牛,
GPU 焔レ deviceQuery.exe 襯 牛 .
貊
MatrixMul.cu 

- 2谿 (16x16) 螻煙
BIG-OH NOTATION O(n^3) O(n)
一壱 4096 16
一一焔ル蟲
384
CORE
vs
* 16 x 16 = 256

* VS 覯蠏 覈襦 Build.

* GPU : 65536 Thread

(256 grid, 256 block)
1 2 3 蠏螳
91ms
4ms384
CORE
1 2 3 蠏螳
91ms 77ms
4ms 4ms384
CORE
1 2 3 蠏螳
91ms 77ms 76ms
4ms 4ms 4ms384
CORE
1 2 3 蠏螳
91ms 77ms 76ms 81.3ms
4ms 4ms 4ms 4ms384
CORE
一一焔ル蟲
2谿 (16x16) 螻煙
0 30 60 90
蠏螳
81.3ms X 1
4ms X 20384
CORE
一一焔ル蟲
2谿 (16x16) 螻煙
hevcdsp_template_CUDA.cu
14. 4. 1 ()
http://en.wikipedia.org/wiki/Discrete_cosine_transform

 るる 豕
cudaMemcpy -> cudaHostRegister
 一一(cuBLAS)
simplMul->cublasSgemm
螻煙一一
matrixMul.cu
matrixMulCUBLAS.cpp
CUDA CUBLAS
Performance 10.33 236.73 23
Time 12.693 0.554 23
x
x

More Related Content

What's hot (20)

願鍵譬 覃一 襦語襯 襦蠏碁覦 語 覦 豌襴 ろ
願鍵譬 覃一 襦語襯  襦蠏碁覦 語 覦 豌襴 ろ願鍵譬 覃一 襦語襯  襦蠏碁覦 語 覦 豌襴 ろ
願鍵譬 覃一 襦語襯 襦蠏碁覦 語 覦 豌襴 ろ
Seunghwa Song
蟆襦語 GPGPU
蟆襦語  GPGPU蟆襦語  GPGPU
蟆襦語 GPGPU
YEONG-CHEON YOU
GPGPU(CUDA)襯 伎 MMOG 貂襴 豢豌襴
GPGPU(CUDA)襯 伎 MMOG 貂襴 豢豌襴GPGPU(CUDA)襯 伎 MMOG 貂襴 豢豌襴
GPGPU(CUDA)襯 伎 MMOG 貂襴 豢豌襴
YEONG-CHEON YOU
Compute shader DX11
Compute shader DX11Compute shader DX11
Compute shader DX11
覩殊
願鍵譬 覃一 蠍磯 Open cv 襦 覦 危襴貅伎
願鍵譬 覃一 蠍磯 Open cv  襦 覦  危襴貅伎 願鍵譬 覃一 蠍磯 Open cv  襦 覦  危襴貅伎
願鍵譬 覃一 蠍磯 Open cv 襦 覦 危襴貅伎
Seunghwa Song
(2013 DEVIEW) 覃一磯 襦蠏碁覦 企Μ ?
(2013 DEVIEW) 覃一磯 襦蠏碁覦  企Μ ? (2013 DEVIEW) 覃一磯 襦蠏碁覦  企Μ ?
(2013 DEVIEW) 覃一磯 襦蠏碁覦 企Μ ?
[2B7]求梶2 メ求梶梶梶梶≡求釈 求 求戟梶=
[2B7]求梶2 メ求梶梶梶梶≡求釈  求 求戟梶=[2B7]求梶2 メ求梶梶梶梶≡求釈  求 求戟梶=
[2B7]求梶2 メ求梶梶梶梶≡求釈 求 求戟梶=
NAVER D2
[FOSS4G Korea 2016] GeoHash襯 伎 讌 覲讌 螻 蟯襴
[FOSS4G Korea 2016] GeoHash襯 伎 讌 覲讌 螻 蟯襴[FOSS4G Korea 2016] GeoHash襯 伎 讌 覲讌 螻 蟯襴
[FOSS4G Korea 2016] GeoHash襯 伎 讌 覲讌 螻 蟯襴
BJ Jang
[2D7]蠍一讀襦 危エ覲企 螳ロ 蟆覯 蟲
[2D7]蠍一讀襦 危エ覲企  螳ロ 蟆覯 蟲[2D7]蠍一讀襦 危エ覲企  螳ロ 蟆覯 蟲
[2D7]蠍一讀襦 危エ覲企 螳ロ 蟆覯 蟲
NAVER D2
mongodb mysql CRUD 一一 焔 觜蟲
mongodb mysql CRUD 一一 焔 觜蟲mongodb mysql CRUD 一一 焔 觜蟲
mongodb mysql CRUD 一一 焔 觜蟲
Woo Yeong Choi
Ai based on gpu
Ai based on gpuAi based on gpu
Ai based on gpu
Tae Young Lee
[2010 CodeEngn Conference 04] hahah - Defcon 18 CTF 覓語
[2010 CodeEngn Conference 04] hahah - Defcon 18 CTF 覓語[2010 CodeEngn Conference 04] hahah - Defcon 18 CTF 覓語
[2010 CodeEngn Conference 04] hahah - Defcon 18 CTF 覓語
GangSeok Lee
Python qgis advanced
Python qgis advancedPython qgis advanced
Python qgis advanced
Jiyoon Kim
Christmas CTF 覲伎 覓語伎(覈:蟲伎伎讀豺)
Christmas CTF 覲伎  覓語伎(覈:蟲伎伎讀豺)Christmas CTF 覲伎  覓語伎(覈:蟲伎伎讀豺)
Christmas CTF 覲伎 覓語伎(覈:蟲伎伎讀豺)
NAVER D2
[2D4]Python _覲
[2D4]Python _覲[2D4]Python _覲
[2D4]Python _覲
NAVER D2
Introduction to Parallel Programming
Introduction to Parallel ProgrammingIntroduction to Parallel Programming
Introduction to Parallel Programming
UNIST
[244] = п釈 梶梶求戟 ≡ メ 釈=
[244] = п釈 梶梶求戟 ≡ メ 釈= [244] = п釈 梶梶求戟 ≡ メ 釈=
[244] = п釈 梶梶求戟 ≡ メ 釈=
NAVER D2
[殊 : 螻] 覓 螻襯 企麹 21瑚鍵 OpenCL 蟆渚
[殊 : 螻] 覓 螻襯 企麹 21瑚鍵  OpenCL 蟆渚[殊 : 螻] 覓 螻襯 企麹 21瑚鍵  OpenCL 蟆渚
[殊 : 螻] 覓 螻襯 企麹 21瑚鍵 OpenCL 蟆渚
Sumin Byeon
Ndc2014 讀 2 : 覃一磯 襦蠏碁覦 企Μ ? (Lock-free Transactional Memory蟾讌)
Ndc2014 讀 2 : 覃一磯 襦蠏碁覦   企Μ ?  (Lock-free Transactional Memory蟾讌)Ndc2014 讀 2 : 覃一磯 襦蠏碁覦   企Μ ?  (Lock-free Transactional Memory蟾讌)
Ndc2014 讀 2 : 覃一磯 襦蠏碁覦 企Μ ? (Lock-free Transactional Memory蟾讌)
Ndc12 2
Ndc12 2Ndc12 2
Ndc12 2
願鍵譬 覃一 襦語襯 襦蠏碁覦 語 覦 豌襴 ろ
願鍵譬 覃一 襦語襯  襦蠏碁覦 語 覦 豌襴 ろ願鍵譬 覃一 襦語襯  襦蠏碁覦 語 覦 豌襴 ろ
願鍵譬 覃一 襦語襯 襦蠏碁覦 語 覦 豌襴 ろ
Seunghwa Song
GPGPU(CUDA)襯 伎 MMOG 貂襴 豢豌襴
GPGPU(CUDA)襯 伎 MMOG 貂襴 豢豌襴GPGPU(CUDA)襯 伎 MMOG 貂襴 豢豌襴
GPGPU(CUDA)襯 伎 MMOG 貂襴 豢豌襴
YEONG-CHEON YOU
Compute shader DX11
Compute shader DX11Compute shader DX11
Compute shader DX11
覩殊
願鍵譬 覃一 蠍磯 Open cv 襦 覦 危襴貅伎
願鍵譬 覃一 蠍磯 Open cv  襦 覦  危襴貅伎 願鍵譬 覃一 蠍磯 Open cv  襦 覦  危襴貅伎
願鍵譬 覃一 蠍磯 Open cv 襦 覦 危襴貅伎
Seunghwa Song
(2013 DEVIEW) 覃一磯 襦蠏碁覦 企Μ ?
(2013 DEVIEW) 覃一磯 襦蠏碁覦  企Μ ? (2013 DEVIEW) 覃一磯 襦蠏碁覦  企Μ ?
(2013 DEVIEW) 覃一磯 襦蠏碁覦 企Μ ?
[2B7]求梶2 メ求梶梶梶梶≡求釈 求 求戟梶=
[2B7]求梶2 メ求梶梶梶梶≡求釈  求 求戟梶=[2B7]求梶2 メ求梶梶梶梶≡求釈  求 求戟梶=
[2B7]求梶2 メ求梶梶梶梶≡求釈 求 求戟梶=
NAVER D2
[FOSS4G Korea 2016] GeoHash襯 伎 讌 覲讌 螻 蟯襴
[FOSS4G Korea 2016] GeoHash襯 伎 讌 覲讌 螻 蟯襴[FOSS4G Korea 2016] GeoHash襯 伎 讌 覲讌 螻 蟯襴
[FOSS4G Korea 2016] GeoHash襯 伎 讌 覲讌 螻 蟯襴
BJ Jang
[2D7]蠍一讀襦 危エ覲企 螳ロ 蟆覯 蟲
[2D7]蠍一讀襦 危エ覲企  螳ロ 蟆覯 蟲[2D7]蠍一讀襦 危エ覲企  螳ロ 蟆覯 蟲
[2D7]蠍一讀襦 危エ覲企 螳ロ 蟆覯 蟲
NAVER D2
mongodb mysql CRUD 一一 焔 觜蟲
mongodb mysql CRUD 一一 焔 觜蟲mongodb mysql CRUD 一一 焔 觜蟲
mongodb mysql CRUD 一一 焔 觜蟲
Woo Yeong Choi
[2010 CodeEngn Conference 04] hahah - Defcon 18 CTF 覓語
[2010 CodeEngn Conference 04] hahah - Defcon 18 CTF 覓語[2010 CodeEngn Conference 04] hahah - Defcon 18 CTF 覓語
[2010 CodeEngn Conference 04] hahah - Defcon 18 CTF 覓語
GangSeok Lee
Python qgis advanced
Python qgis advancedPython qgis advanced
Python qgis advanced
Jiyoon Kim
Christmas CTF 覲伎 覓語伎(覈:蟲伎伎讀豺)
Christmas CTF 覲伎  覓語伎(覈:蟲伎伎讀豺)Christmas CTF 覲伎  覓語伎(覈:蟲伎伎讀豺)
Christmas CTF 覲伎 覓語伎(覈:蟲伎伎讀豺)
NAVER D2
[2D4]Python _覲
[2D4]Python _覲[2D4]Python _覲
[2D4]Python _覲
NAVER D2
Introduction to Parallel Programming
Introduction to Parallel ProgrammingIntroduction to Parallel Programming
Introduction to Parallel Programming
UNIST
[244] = п釈 梶梶求戟 ≡ メ 釈=
[244] = п釈 梶梶求戟 ≡ メ 釈= [244] = п釈 梶梶求戟 ≡ メ 釈=
[244] = п釈 梶梶求戟 ≡ メ 釈=
NAVER D2
[殊 : 螻] 覓 螻襯 企麹 21瑚鍵 OpenCL 蟆渚
[殊 : 螻] 覓 螻襯 企麹 21瑚鍵  OpenCL 蟆渚[殊 : 螻] 覓 螻襯 企麹 21瑚鍵  OpenCL 蟆渚
[殊 : 螻] 覓 螻襯 企麹 21瑚鍵 OpenCL 蟆渚
Sumin Byeon
Ndc2014 讀 2 : 覃一磯 襦蠏碁覦 企Μ ? (Lock-free Transactional Memory蟾讌)
Ndc2014 讀 2 : 覃一磯 襦蠏碁覦   企Μ ?  (Lock-free Transactional Memory蟾讌)Ndc2014 讀 2 : 覃一磯 襦蠏碁覦   企Μ ?  (Lock-free Transactional Memory蟾讌)
Ndc2014 讀 2 : 覃一磯 襦蠏碁覦 企Μ ? (Lock-free Transactional Memory蟾讌)
Ndc12 2
Ndc12 2Ndc12 2
Ndc12 2

Similar to ffmpeg optimization using CUDA (20)

Deview 2019 覦蟲
Deview 2019 覦蟲Deview 2019 覦蟲
Deview 2019 覦蟲
hanbeom Park
[Td 2015]轟 c++ 貊 覈 c++襦 蠍磯豺蠍(レ穐)
[Td 2015]轟 c++ 貊 覈 c++襦 蠍磯豺蠍(レ穐)[Td 2015]轟 c++ 貊 覈 c++襦 蠍磯豺蠍(レ穐)
[Td 2015]轟 c++ 貊 覈 c++襦 蠍磯豺蠍(レ穐)
Sang Don Kim
[TechDays Korea 2015] 轟 C++ 貊 覈 C++襦 蠍磯豺蠍
[TechDays Korea 2015] 轟 C++ 貊 覈 C++襦 蠍磯豺蠍[TechDays Korea 2015] 轟 C++ 貊 覈 C++襦 蠍磯豺蠍
[TechDays Korea 2015] 轟 C++ 貊 覈 C++襦 蠍磯豺蠍
Chris Ohk
覿伎る 覲 覦, From c++98 to c++11, 14
覿伎る 覲 覦, From c++98 to c++11, 14 覿伎る 覲 覦, From c++98 to c++11, 14
覿伎る 覲 覦, From c++98 to c++11, 14
覈 蟾
Basic git-commands
Basic git-commandsBasic git-commands
Basic git-commands
insanehong Kim
7螳讌 覈 - 一危 覲
7螳讌  覈 - 一危 覲7螳讌  覈 - 一危 覲
7螳讌 覈 - 一危 覲
HyeonSeok Choi
TABLE ACCESS 伎 伎 SQL _Wh oracle
TABLE ACCESS 伎 伎 SQL _Wh oracleTABLE ACCESS 伎 伎 SQL _Wh oracle
TABLE ACCESS 伎 伎 SQL _Wh oracle
禽掘閣鴛掘安-酷雨晦晦-螳.沿沿岳恰
禽掘閣鴛掘安-酷雨晦晦-螳.沿沿岳恰禽掘閣鴛掘安-酷雨晦晦-螳.沿沿岳恰
禽掘閣鴛掘安-酷雨晦晦-螳.沿沿岳恰
hanbeom Park
Android+init+process
Android+init+processAndroid+init+process
Android+init+process
Hong Jae Kwon
KTH_Detail day_煙 螳覦 蠍煙 蠍壱 襴讀_5谿_一危磯_譟磯_20120613
KTH_Detail day_煙  螳覦 蠍煙  蠍壱 襴讀_5谿_一危磯_譟磯_20120613KTH_Detail day_煙  螳覦 蠍煙  蠍壱 襴讀_5谿_一危磯_譟磯_20120613
KTH_Detail day_煙 螳覦 蠍煙 蠍壱 襴讀_5谿_一危磯_譟磯_20120613
KTH, 貅危壱危
3ds maxscript 襴_20151206_讌
3ds maxscript 襴_20151206_讌3ds maxscript 襴_20151206_讌
3ds maxscript 襴_20151206_讌
JinTaek Seo
190821 delphi
190821 delphi190821 delphi
190821 delphi
Hyeon-Woo Sa
襦蠏碁覦 : C++11 伎手鍵
襦蠏碁覦 : C++11 伎手鍵襦蠏碁覦 : C++11 伎手鍵
襦蠏碁覦 : C++11 伎手鍵
Jongwook Choi
Mongo db 豕覯蠏
Mongo db 豕覯蠏Mongo db 豕覯蠏
Mongo db 豕覯蠏
beom kyun choi
Tensorflow regression 襦 蠏
Tensorflow regression 襦 蠏Tensorflow regression 襦 蠏
Tensorflow regression 襦 蠏
beom kyun choi
miss_pattern_v2
miss_pattern_v2miss_pattern_v2
miss_pattern_v2
YoungSu Son
[0312 譟一] good bye dx9
[0312 譟一] good bye dx9[0312 譟一] good bye dx9
[0312 譟一] good bye dx9
讌 譟
HI-ARC PS 101
HI-ARC PS 101HI-ARC PS 101
HI-ARC PS 101
Jae-yeol Lee
MapReduce ろ (K-mer Counting, K-means Clustering)
MapReduce ろ  (K-mer Counting, K-means Clustering)MapReduce ろ  (K-mer Counting, K-means Clustering)
MapReduce ろ (K-mer Counting, K-means Clustering)
譯殊
2012 Dm A0 04 Pdf
2012 Dm A0 04 Pdf2012 Dm A0 04 Pdf
2012 Dm A0 04 Pdf
kd19h
Deview 2019 覦蟲
Deview 2019 覦蟲Deview 2019 覦蟲
Deview 2019 覦蟲
hanbeom Park
[Td 2015]轟 c++ 貊 覈 c++襦 蠍磯豺蠍(レ穐)
[Td 2015]轟 c++ 貊 覈 c++襦 蠍磯豺蠍(レ穐)[Td 2015]轟 c++ 貊 覈 c++襦 蠍磯豺蠍(レ穐)
[Td 2015]轟 c++ 貊 覈 c++襦 蠍磯豺蠍(レ穐)
Sang Don Kim
[TechDays Korea 2015] 轟 C++ 貊 覈 C++襦 蠍磯豺蠍
[TechDays Korea 2015] 轟 C++ 貊 覈 C++襦 蠍磯豺蠍[TechDays Korea 2015] 轟 C++ 貊 覈 C++襦 蠍磯豺蠍
[TechDays Korea 2015] 轟 C++ 貊 覈 C++襦 蠍磯豺蠍
Chris Ohk
覿伎る 覲 覦, From c++98 to c++11, 14
覿伎る 覲 覦, From c++98 to c++11, 14 覿伎る 覲 覦, From c++98 to c++11, 14
覿伎る 覲 覦, From c++98 to c++11, 14
覈 蟾
7螳讌 覈 - 一危 覲
7螳讌  覈 - 一危 覲7螳讌  覈 - 一危 覲
7螳讌 覈 - 一危 覲
HyeonSeok Choi
TABLE ACCESS 伎 伎 SQL _Wh oracle
TABLE ACCESS 伎 伎 SQL _Wh oracleTABLE ACCESS 伎 伎 SQL _Wh oracle
TABLE ACCESS 伎 伎 SQL _Wh oracle
禽掘閣鴛掘安-酷雨晦晦-螳.沿沿岳恰
禽掘閣鴛掘安-酷雨晦晦-螳.沿沿岳恰禽掘閣鴛掘安-酷雨晦晦-螳.沿沿岳恰
禽掘閣鴛掘安-酷雨晦晦-螳.沿沿岳恰
hanbeom Park
Android+init+process
Android+init+processAndroid+init+process
Android+init+process
Hong Jae Kwon
KTH_Detail day_煙 螳覦 蠍煙 蠍壱 襴讀_5谿_一危磯_譟磯_20120613
KTH_Detail day_煙  螳覦 蠍煙  蠍壱 襴讀_5谿_一危磯_譟磯_20120613KTH_Detail day_煙  螳覦 蠍煙  蠍壱 襴讀_5谿_一危磯_譟磯_20120613
KTH_Detail day_煙 螳覦 蠍煙 蠍壱 襴讀_5谿_一危磯_譟磯_20120613
KTH, 貅危壱危
3ds maxscript 襴_20151206_讌
3ds maxscript 襴_20151206_讌3ds maxscript 襴_20151206_讌
3ds maxscript 襴_20151206_讌
JinTaek Seo
襦蠏碁覦 : C++11 伎手鍵
襦蠏碁覦 : C++11 伎手鍵襦蠏碁覦 : C++11 伎手鍵
襦蠏碁覦 : C++11 伎手鍵
Jongwook Choi
Tensorflow regression 襦 蠏
Tensorflow regression 襦 蠏Tensorflow regression 襦 蠏
Tensorflow regression 襦 蠏
beom kyun choi
miss_pattern_v2
miss_pattern_v2miss_pattern_v2
miss_pattern_v2
YoungSu Son
[0312 譟一] good bye dx9
[0312 譟一] good bye dx9[0312 譟一] good bye dx9
[0312 譟一] good bye dx9
讌 譟
MapReduce ろ (K-mer Counting, K-means Clustering)
MapReduce ろ  (K-mer Counting, K-means Clustering)MapReduce ろ  (K-mer Counting, K-means Clustering)
MapReduce ろ (K-mer Counting, K-means Clustering)
譯殊
2012 Dm A0 04 Pdf
2012 Dm A0 04 Pdf2012 Dm A0 04 Pdf
2012 Dm A0 04 Pdf
kd19h

More from yyooooon (8)

#15.7.16 Presentation in UCI
#15.7.16 Presentation in UCI#15.7.16 Presentation in UCI
#15.7.16 Presentation in UCI
yyooooon
[150824]symposium v4
[150824]symposium v4[150824]symposium v4
[150824]symposium v4
yyooooon
about message coalescing
about message coalescingabout message coalescing
about message coalescing
yyooooon
HM10 for presentation
HM10 for presentationHM10 for presentation
HM10 for presentation
yyooooon
Hm10 Research sheets
Hm10 Research sheetsHm10 Research sheets
Hm10 Research sheets
yyooooon
Android_01 Install Eclipse ADK/formatter/checker
Android_01 Install Eclipse ADK/formatter/checker Android_01 Install Eclipse ADK/formatter/checker
Android_01 Install Eclipse ADK/formatter/checker
yyooooon
MCP3008 & TMP36 伎 豸′ 覦
MCP3008 & TMP36  伎 豸′ 覦 MCP3008 & TMP36  伎 豸′ 覦
MCP3008 & TMP36 伎 豸′ 覦
yyooooon
01喝殊覯襴伎誤
01喝殊覯襴伎誤01喝殊覯襴伎誤
01喝殊覯襴伎誤
yyooooon
#15.7.16 Presentation in UCI
#15.7.16 Presentation in UCI#15.7.16 Presentation in UCI
#15.7.16 Presentation in UCI
yyooooon
[150824]symposium v4
[150824]symposium v4[150824]symposium v4
[150824]symposium v4
yyooooon
about message coalescing
about message coalescingabout message coalescing
about message coalescing
yyooooon
HM10 for presentation
HM10 for presentationHM10 for presentation
HM10 for presentation
yyooooon
Hm10 Research sheets
Hm10 Research sheetsHm10 Research sheets
Hm10 Research sheets
yyooooon
Android_01 Install Eclipse ADK/formatter/checker
Android_01 Install Eclipse ADK/formatter/checker Android_01 Install Eclipse ADK/formatter/checker
Android_01 Install Eclipse ADK/formatter/checker
yyooooon
MCP3008 & TMP36 伎 豸′ 覦
MCP3008 & TMP36  伎 豸′ 覦 MCP3008 & TMP36  伎 豸′ 覦
MCP3008 & TMP36 伎 豸′ 覦
yyooooon
01喝殊覯襴伎誤
01喝殊覯襴伎誤01喝殊覯襴伎誤
01喝殊覯襴伎誤
yyooooon

ffmpeg optimization using CUDA

  • 2. Contents I. 覲襦蠏碁覦 1. 螳牛 螳 2. CUDA Distributed Computing 危 II. 螳覦蟆盾 1. 螳覦蟆 蟲螻殊 2. 覲豌襴 ろ觜
  • 4. Case 1. 螻煙 a b 1 3 5 7 2 4 6 8 る螳 8覯 access 蠍一ヾ 襦蠏碁 CPU Core螳 覈 一一一 豌襴. 讌襷 Matrix 一 螳 覦覲給語 襷 一 襦蠏碁 企ゼ 豌襴 蠏 讌螻殊 譬讌 覈詩. Thread襯 襴蠍一 CPU Core 螳 . for { for { mat(i)*mat(j); } }
  • 5. Case 1. 螻煙 a b c d 1 1 1 1 2 2 2 2 螳螳 讌, 朱 GPU Global Memory Data copy 螳螳 CUDA core 螳 Matrix cell 螳螳 覲襦 豌襴蟆 . 讀, CUDA core 螳 襷 襦 覯 覲襦 豌襴 殊 螳 企蟆 .
  • 6. 觜螻 CUDA 6 覯 覿磯 覃覈襴 牛 http://www.theregister.co.uk/2013/11/16/ nvidia_reveals_cuda_6_joins_cpugpu_shared_memory_party/
  • 7. Linux CUDA 螳覦蟆 蟲 14. 2. 25 ()
  • 8. 覈谿 螳覦蟆 譟一奄 - Ubuntu Linux 12.04 Desktop - nVidia Graphic driver - GCC Compiler (v4.6) - CUDA Toolkit (5.5) nsight for eclipse git 伎 襦 豢螳
  • 9. 1. Ubuntu Linux 12.04 http://www.ubuntu.com/ * 覈 ろ蟆曙 Ubuntu 12.04 襦 旧
  • 10. 2. Graphic Driver CUDA襯 讌 蠏碁曙拘 語 . $ lspci | grep -i nvidia れ 危碁ゼ 牛 貎 讌覿 Compute Capability 襯 . * https://developer.nvidia.com/cuda-gpus 豢ル語 朱 殊企 覯 一危 . http://docs.nvidia.com/cuda/cuda-getting-started-guide-for-linux/index.html#system-requirements * 襴 伎豌伎 れ れ 蟆 企, 覩語れ 危碁ゼ 谿瑚.
  • 11. GCC Compiler れ $ sudo apt-cache search gcc // Repository searching $ sudo apt-get install gcc-4.6 // Install GCC v4.6 * 4.8 覯 貉危殊 覦 誤蠍一, 螳ロ覃 4.6覯 蟠. 3. GCC Compiler
  • 12. 4. CUDA Toolkit 1. Terminal 牛 CUDA れ $ sudo apt-get update $ sudo apt-get install cuda -y 轟 伎 .run れ企 / れ > https://developer.nvidia.com/cuda-downloads 2. 蟆暑 れ 1) home 襴 ls -a 覈轟企 .bashrc 覓 呉 2) .bashrc vim 牛, れ 伎 蠍一 export PATH=/usr/local/cuda-5.5/bin:$PATH export LD_LIBRARY_PATH=/usr/local/cuda-5.5/lib64:$LD_LIBRARY_PATH 3) source ~/.bashrc 襦 蟆暑 http://docs.nvidia.com/cuda/cuda-getting-started-guide-for-linux/index.html#system-requirements 轟 朱 touch 覈轟企
  • 14. git Core れ Work with: http://download.eclipse.org/releases/juno help
  • 15. Eclipse Egit , Mylyn gitHub Feature, Eclipse jGit, Eclipse Mylyn
  • 19. ffmpeg Source git 蟆暑 : git://source.ffmpeg.org/ffmpeg.git
  • 21. 覦 襦碁ゼ 螻 苦 襴 豺 Next
  • 24. 焔 襦碁ゼ Terminal 牛 企 启 con鍖gure, 襭覃 觜螳 螳ロ 蟆曙 譟一焔.
  • 26. 觜 襭
  • 27. CUDA - DCT Processing Optimization 14. 3. 11 () http://en.wikipedia.org/wiki/Discrete_cosine_transform
  • 28. 覈谿 譯狩 覲譟 一謂 - DST Processing in ffmpeg - GOLD 覯 vs CUDA 覯 譯狩 覲譟 豕 螻殊 CPU / GPU 一一焔ル蟲 - - 螻煙 一 螳 觜蟲 - 譯狩 覲譟 豕 蟆郁骸
  • 29. 譯狩 覲譟 一一 static void FUNC(transform_32x32_add)() static void FUNC(transform_16x16_add)() static void FUNC(transform_8x8_add)() DST Processing in 鍖mpeg * <ffmpeg> libavcodec/hevcdsp_template.c
  • 30. static void FUNC(transform_32x32_add)(uint8_t *_dst, int16_t *coeffs, ptrdiff_t stride) { int i; pixel *dst = (pixel *)_dst; int shift = 7; int add = 1 << (shift - 1); int16_t *src = /yyooooon/ffmpeg-optimization-using-cuda/coeffs; stride /= sizeof(pixel); for (i = 0; i < 32; i++) { TR_32(src, src, 32, 32, SCALE); src++; } src = /yyooooon/ffmpeg-optimization-using-cuda/coeffs; shift = 20 - BIT_DEPTH; add = 1 << (shift - 1); for (i = 0; i < 32; i++) { TR_32(dst, coeffs, 1, 1, ADD_AND_SCALE); coeffs += 32; dst += stride; } } 32 transform_32x32_add()
  • 31. static void FUNC(transform_32x32_add)(uint8_t *_dst, int16_t *coeffs, ptrdiff_t stride) { int i; pixel *dst = (pixel *)_dst; int shift = 7; int add = 1 << (shift - 1); int16_t *src = /yyooooon/ffmpeg-optimization-using-cuda/coeffs; stride /= sizeof(pixel); for (i = 0; i < 32; i++) { TR_32(src, src, 32, 32, SCALE); src++; } src = /yyooooon/ffmpeg-optimization-using-cuda/coeffs; shift = 20 - BIT_DEPTH; add = 1 << (shift - 1); for (i = 0; i < 32; i++) { TR_32(dst, coeffs, 1, 1, ADD_AND_SCALE); coeffs += 32; dst += stride; } } transform_32x32_add() #define TR_32(dst, src, dstep, sstep, assign) do { int i, j; int e_32[16]; int o_32[16] = { 0 }; for (i = 0; i < 16; i++) for (j = 1; j < 32; j += 2) o_32[i] += transform[j][i] * src[j * sstep]; TR_16(e_32, src, 1, 2 * sstep, SET); for (i = 0; i < 16; i++) { assign(dst[i * dstep], e_32[i] + o_32[i]); assign(dst[(31 - i) * dstep], e_32[i] - o_32[i]); } } while (0) 32* (16*32) 512
  • 32. static void FUNC(transform_32x32_add)(uint8_t *_dst, int16_t *coeffs, ptrdiff_t stride) { int i; pixel *dst = (pixel *)_dst; int shift = 7; int add = 1 << (shift - 1); int16_t *src = /yyooooon/ffmpeg-optimization-using-cuda/coeffs; stride /= sizeof(pixel); for (i = 0; i < 32; i++) { TR_32(src, src, 32, 32, SCALE); src++; } src = /yyooooon/ffmpeg-optimization-using-cuda/coeffs; shift = 20 - BIT_DEPTH; add = 1 << (shift - 1); for (i = 0; i < 32; i++) { TR_32(dst, coeffs, 1, 1, ADD_AND_SCALE); coeffs += 32; dst += stride; } } transform_32x32_add() #define TR_32(dst, src, dstep, sstep, assign) do { int i, j; int e_32[16]; int o_32[16] = { 0 }; for (i = 0; i < 16; i++) for (j = 1; j < 32; j += 2) o_32[i] += transform[j][i] * src[j * sstep]; TR_16(e_32, src, 1, 2 * sstep, SET); for (i = 0; i < 16; i++) { assign(dst[i * dstep], e_32[i] + o_32[i]); assign(dst[(31 - i) * dstep], e_32[i] - o_32[i]); } } while (0) 32* [(16*32)+(8*16)] #define TR_16(dst, src, dstep, sstep, assign) do { int i, j; int e_16[8]; int o_16[8] = { 0 }; for (i = 0; i < 8; i++) for (j = 1; j < 16; j += 2) o_16[i] += transform[2 * j][i] * src[j * sstep]; TR_8(e_16, src, 1, 2 * sstep, SET); for (i = 0; i < 8; i++) { assign(dst[i * dstep], e_16[i] + o_16[i]); assign(dst[(15 - i) * dstep], e_16[i] - o_16[i]); } } while (0) 512 128
  • 33. static void FUNC(transform_32x32_add)(uint8_t *_dst, int16_t *coeffs, ptrdiff_t stride) { int i; pixel *dst = (pixel *)_dst; int shift = 7; int add = 1 << (shift - 1); int16_t *src = /yyooooon/ffmpeg-optimization-using-cuda/coeffs; stride /= sizeof(pixel); for (i = 0; i < 32; i++) { TR_32(src, src, 32, 32, SCALE); src++; } src = /yyooooon/ffmpeg-optimization-using-cuda/coeffs; shift = 20 - BIT_DEPTH; add = 1 << (shift - 1); for (i = 0; i < 32; i++) { TR_32(dst, coeffs, 1, 1, ADD_AND_SCALE); coeffs += 32; dst += stride; } } transform_32x32_add() #define TR_32(dst, src, dstep, sstep, assign) do { int i, j; int e_32[16]; int o_32[16] = { 0 }; for (i = 0; i < 16; i++) for (j = 1; j < 32; j += 2) o_32[i] += transform[j][i] * src[j * sstep]; TR_16(e_32, src, 1, 2 * sstep, SET); for (i = 0; i < 16; i++) { assign(dst[i * dstep], e_32[i] + o_32[i]); assign(dst[(31 - i) * dstep], e_32[i] - o_32[i]); } } while (0) 32*{[(16*32)+(8*16)]+(4*8)} #define TR_16(dst, src, dstep, sstep, assign) do { int i, j; int e_16[8]; int o_16[8] = { 0 }; for (i = 0; i < 8; i++) for (j = 1; j < 16; j += 2) o_16[i] += transform[2 * j][i] * src[j * sstep]; TR_8(e_16, src, 1, 2 * sstep, SET); for (i = 0; i < 8; i++) { assign(dst[i * dstep], e_16[i] + o_16[i]); assign(dst[(15 - i) * dstep], e_16[i] - o_16[i]); } } while (0) #define TR_8(dst, src, dstep, sstep, assign) do { int i, j; int e_8[4]; int o_8[4] = { 0 }; for (i = 0; i < 4; i++) for (j = 1; j < 8; j += 2) o_8[i] += transform[4 * j][i] * src[j * sstep]; TR_4(e_8, src, 1, 2 * sstep, SET); for (i = 0; i < 4; i++) { assign(dst[i * dstep], e_8[i] + o_8[i]); assign(dst[(7 - i) * dstep], e_8[i] - o_8[i]); } } while (0) 512 128 32
  • 34. transform_32x32_add() TR_32 (512) TR_16 (128) TR_8 (32) TR_4 (8) X 32 = 21,760 X 2 = 43,520 GOLD 覯 一壱
  • 35. GOLD Ver. vs CUDA Ver. : 512 threads = 512 = 1 for (i = 0; i < 16; i++) for (j = 1; j < 32; j += 2) o_32[i] += transform[j][i] * src[j * sstep]; i j transform_32x32_add()
  • 36. GOLD Ver. vs CUDA Ver. TR_32 (512) TR_16 (128) TR_8 (32) TR_4 (8) X 32 = 21,760 TR_32 (1) TR_16 (1) TR_8 (1) TR_4 (8) X 32 = 352 GOLD Ver. CUDA Ver. transform_32x32_add()
  • 37. GOLD Ver. vs CUDA Ver. Ver. DST 32 X 32 16 X 16 8 X 8 GOLD 43,520 5376 640 CUDA 704 320 144 X 61.8 X 16.8 X 4.4 一壱 觜蟲 * o_n[i] += transform[n * j][i] * src[j * sstep]; 企ゼ 一壱 豐 襯 覩誤.
  • 38. 譯狩 覲譟 豕 螻殊 make鍖le libavcodec make鍖le hevcdsp.h hevcdsp.c hevcdsp_template.c (+) hevcdsp_CUDA_functions.cu * 覈 CUDA hevcdsp_CUDA_functions.cu 給. hevedsp.c, hevcdsp_tmplate.c 螻褐 CUDA 襯 螳語 .
  • 39. 豕 make鍖le LIBS-ffmpeg += -L /usr/local/cuda/lib64 -lcudart LIBS-ffprobe += -L /usr/local/cuda/lib64 -lcudart LIBS-ffserver += -L /usr/local/cuda/lib64 -lcudart ≒≒≒≒≒ libavcodec/hevcdsp_CUDA_functions.o: libavcodec/hevcdsp_CUDA_functions.cu /usr/local/cuda-5.5/bin/nvcc -G -g -O0 -gencode arch=compute_10,code=sm_10 -odir "." -M -o "libavcodec/hevcdsp_CUDA_functions.d" libavcodec/hevcdsp_CUDA_functions.cu" /usr/local/cuda-5.5/bin/nvcc --compile -G -O0 -g -gencode arch=compute_10,code=compute_10 -gencode arch=compute_10,code=sm_10 -x cu -o "libavcodec/hevcdsp_CUDA_functions.o" "libavcodec/hevcdsp_CUDA_functions.cu" * 鍖mpeg 煙 cuda-template 蟆 譴. * .cu 鍖le 觜讌 豕 make鍖le 讌.
  • 41. libavcodec - hevcdsp.h // (Yoon) bgn ... void DP_Copy_transform_ToCudaMem(); void DP_Free_transform_FromCudaMem(); void DP_TR8_Add(int8_t *T, int8_t *S, int8_t *O, int sstep); ≒≒≒≒≒ // (Yoon) ... end * hevcdsp_CUDA_functions.cu 碁 hevcdsp.h 誤 蟆 .
  • 42. libavcodec - hevcdsp_template.c #define TR_8(dst, src, dstep, sstep, assign) do { int e_8[4]; int o_8[4] = { 0 }; DP_Copy_src_ToCudaMem(8, sstep); DP_TR8_Add(o_8, sstep); DP_Free_src_FromCudaMem(); TR_4(e_8, src, 1, 2 * sstep, SET); for (i = 0; i < 4; i++) { assign(dst[i * dstep], e_8[i] + o_8[i]); assign(dst[(7 - i) * dstep], e_8[i] - o_8[i]); } } while (0)
  • 43. libavcodec - hevcdsp_CUDA_functions.cu void DP_Copy_transform_ToCudaMem( ) void DP_TR8_Add (int *o_8, int sstep) __global__ void TR8_PARALLEL_ADD (int8_t *T, int8_t *S, int8_t *O, int sstep)
  • 44. 襦語 CORE i5-3230M GeForce GT 740M 企 2.60 GHz 1.03 GHz 貊願渥 2 Cores 384 CUDA Cores 一一焔ル蟲 384 CORE vs * CPU 焔レ Windows PC覲 襯 牛, GPU 焔レ deviceQuery.exe 襯 牛 .
  • 45. 貊 MatrixMul.cu - 2谿 (16x16) 螻煙 BIG-OH NOTATION O(n^3) O(n) 一壱 4096 16 一一焔ル蟲 384 CORE vs * 16 x 16 = 256 * VS 覯蠏 覈襦 Build. * GPU : 65536 Thread (256 grid, 256 block)
  • 46. 1 2 3 蠏螳 91ms 4ms384 CORE
  • 47. 1 2 3 蠏螳 91ms 77ms 4ms 4ms384 CORE
  • 48. 1 2 3 蠏螳 91ms 77ms 76ms 4ms 4ms 4ms384 CORE
  • 49. 1 2 3 蠏螳 91ms 77ms 76ms 81.3ms 4ms 4ms 4ms 4ms384 CORE 一一焔ル蟲 2谿 (16x16) 螻煙
  • 50. 0 30 60 90 蠏螳 81.3ms X 1 4ms X 20384 CORE 一一焔ル蟲 2谿 (16x16) 螻煙
  • 51. hevcdsp_template_CUDA.cu 14. 4. 1 () http://en.wikipedia.org/wiki/Discrete_cosine_transform
  • 52. るる 豕 cudaMemcpy -> cudaHostRegister 一一(cuBLAS) simplMul->cublasSgemm