�ݺ�ߣ

CUDA Distributed Computing
14’ 2월

Contents
I. 병렬프로그래밍 
1. 간략한 소개 
2. CUDA Distributed Computing 에 대한 이해
II. 개발환경 
1. 개발환경 구현과정 
2. 병렬처리 실효율비

I. 병렬프로그래밍
Multi Core Many Core

Case 1. 행렬곱셈
a b
1 3 5 7
2 4 6 8
하나의 스래드가
8번 access
기존의 프로그램에선 CPU Core가
모든 정수산술연산을 처리했다. 하지만
Matrix 연산 같이 반복문이 많이 쓰인
프로그램에선 이를 처리 함에 있어
그 진행과정이 좋지 못하다. Thread를
늘리기엔 CPU Core의 수가 적다.
for {
for {
mat(i)*mat(j);
}
}

Case 1. 행렬곱셈
a b
c d
1 1 1 1
2 2 2 2
각각의 셀이
동시에 진행,
대안으로 GPU Global Memory 에
Data copy 후 각각의 CUDA core
가 하나의 Matrix cell을 각각 병렬로
처리하게 한다. 즉, CUDA core 의
수가 많을 수록 한번에 병렬로 처리할
수 있는 일의 수가 늘어나게 된다.

비고
CUDA 6 버전 부터는 메모리 통합
• http://www.theregister.co.uk/2013/11/16/
nvidia_reveals_cuda_6_joins_cpugpu_shared_memory_party/

Linux에서 CUDA 개발환경 구현
14. 2. 25 (화)

목차
개발환경 조성 
- Ubuntu Linux 12.04 Desktop 
- nVidia Graphic driver 
- GCC Compiler (v4.6) 
- CUDA Toolkit (5.5)
nsight for eclipse
git을 이용한 프로젝트 추가

1. Ubuntu Linux 12.04
http://www.ubuntu.com/
* 모든 실험환경은 Ubuntu 12.04 로 통일

2. Graphic Driver
CUDA를 지원하는 그래픽카드 인지 확인. 
$ lspci | grep -i nvidia 
 
 
 
다음 사이트를 통해 쿠다 지원여부와 Compute Capability 를 확인. 
* https://developer.nvidia.com/cuda-gpus 
출력문이 없으면 드라이버 버전 업데이트 필요.
http://docs.nvidia.com/cuda/cuda-getting-started-guide-for-linux/index.html#system-requirements
* 리눅스 운영체제 설치와 동시에 설치되는 것이 정상이나, 미설치시엔 아래 사이트를 참고.

GCC Compiler 설치 
$ sudo apt-cache search gcc // Repository searching 
$ sudo apt-get install gcc-4.6 // Install GCC v4.6 
 
 
 
 
 
 
 
 
* 4.8 버전에서 컴파일에러 발생을 확인하였기에, 가능하면 4.6버전을 권장. 
3. GCC Compiler

4. CUDA Toolkit
1. Terminal 을 통한 CUDA 설치 
$ sudo apt-get update 
$ sudo apt-get install cuda -y 
 
혹은 홈페이지에서 .run파일 다운로드 / 설치  
> https://developer.nvidia.com/cuda-downloads
2. 환경변수 설정  
1) home 디랙토리에서 ls -a 명령어로 ‘.bashrc’ 유무 확인 
2) .bashrc에 vim을 통하여, 다음 내용을 기입 
export PATH=/usr/local/cuda-5.5/bin:$PATH 
export LD_LIBRARY_PATH=/usr/local/cuda-5.5/lib64:$LD_LIBRARY_PATH 
3) source ~/.bashrc 로 환경변수 적용
http://docs.nvidia.com/cuda/cuda-getting-started-guide-for-linux/index.html#system-requirements
혹여나 없으면
touch 명령어로
생성

git Core 설치
Work with: http://download.eclipse.org/releases/juno
help 탭

Eclipse Egit ,
Mylyn gitHub Feature,
Eclipse jGit,
Eclipse Mylyn

ﬀmpeg Project Importing
https://trac.ffmpeg.org/wiki/How%20to%20setup%20Eclipse%20IDE%20for%20FFmpeg%20development
ffmpeg 사이트에서 제공하는 설치방법 URL

ffmpeg Source git 경로 :
git://source.ffmpeg.org/ffmpeg.git

Deselect All,
Master 브랜치만 선택
후 Next

받을 프로젝트를 저장
하고 싶은 디랙토리 위치 설
정 후 Next

Cloning이 끝나면 해당
프로젝트를

ffmpeg optimization using CUDA

생성된 프로젝트를 Terminal을 통해 해당 폴더에서 conﬁgure,
완료되면 빌드가 가능한 환경이 조성됩니다.

빌드 완료 후 실행 예시

CUDA - DCT Processing Optimization
14. 3. 11 (화)
http://en.wikipedia.org/wiki/Discrete_cosine_transform

목차
• 주파수 변조 연산 
- DST Processing in ffmpeg  
- GOLD 버전 vs CUDA 버전
• 주파수 변조 최적화 과정
• CPU / GPU 연산성능비교 
- 하드웨어 사양 
- 행렬곱셈 연산 시간 비교 
- 주파수 변조 최적화 결과

주파수 변조 연산에 사용되는 함수  
• static void FUNC(transform_32x32_add)(…)
• static void FUNC(transform_16x16_add)(…)
• static void FUNC(transform_8x8_add)(…) 
 
DST Processing in ﬀmpeg
* <ffmpeg> libavcodec/hevcdsp_template.c

static void FUNC(transform_32x32_add)(uint8_t *_dst, int16_t *coeffs,
ptrdiff_t stride)
{
int i;
pixel *dst = (pixel *)_dst;
int shift = 7;
int add = 1 << (shift - 1);
int16_t *src = /yyooooon/ffmpeg-optimization-using-cuda/coeffs;
stride /= sizeof(pixel);
for (i = 0; i < 32; i++) {
TR_32(src, src, 32, 32, SCALE);
src++;
}
src = /yyooooon/ffmpeg-optimization-using-cuda/coeffs;
shift = 20 - BIT_DEPTH;
add = 1 << (shift - 1);
for (i = 0; i < 32; i++) {
TR_32(dst, coeffs, 1, 1, ADD_AND_SCALE);
coeffs += 32;
dst += stride;
}
}
• 32
transform_32x32_add(…)

ptrdiff_t stride)
{
int i;
int shift = 7;
for (i = 0; i < 32; i++) {
src++;
}
add = 1 << (shift - 1);
for (i = 0; i < 32; i++) {
coeffs += 32;
dst += stride;
}
}
#define TR_32(dst, src, dstep, sstep, assign)
do {
int i, j;
int e_32[16];
int o_32[16] = { 0 };
for (i = 0; i < 16; i++)
for (j = 1; j < 32; j += 2)
o_32[i] += transform[j][i] * src[j * sstep];
TR_16(e_32, src, 1, 2 * sstep, SET);

for (i = 0; i < 16; i++) {
assign(dst[i * dstep], e_32[i] + o_32[i]);
assign(dst[(31 - i) * dstep], e_32[i] - o_32[i]);
}
} while (0)
• 32* (16*32)
512

ptrdiff_t stride)
{
int i;
int shift = 7;
for (i = 0; i < 32; i++) {
src++;
}
add = 1 << (shift - 1);
for (i = 0; i < 32; i++) {
coeffs += 32;
dst += stride;
}
}
do {
int i, j;
int e_32[16];
int o_32[16] = { 0 };
for (i = 0; i < 16; i++)
for (j = 1; j < 32; j += 2)
TR_16(e_32, src, 1, 2 * sstep, SET);

for (i = 0; i < 16; i++) {
}
} while (0)
• 32* [(16*32)+(8*16)]
do {
int i, j;
int e_16[8];
int o_16[8] = { 0 };
for (i = 0; i < 8; i++)
for (j = 1; j < 16; j += 2)
o_16[i] += transform[2 * j][i] * src[j * sstep];
TR_8(e_16, src, 1, 2 * sstep, SET);

for (i = 0; i < 8; i++) {
}
} while (0)
512 128

ptrdiff_t stride)
{
int i;
int shift = 7;
for (i = 0; i < 32; i++) {
src++;
}
add = 1 << (shift - 1);
for (i = 0; i < 32; i++) {
coeffs += 32;
dst += stride;
}
}
do {
int i, j;
int e_32[16];
int o_32[16] = { 0 };
for (i = 0; i < 16; i++)
for (j = 1; j < 32; j += 2)
TR_16(e_32, src, 1, 2 * sstep, SET);

for (i = 0; i < 16; i++) {
}
} while (0)
• 32*{[(16*32)+(8*16)]+(4*8)}
do {
int i, j;
int e_16[8];
int o_16[8] = { 0 };
for (i = 0; i < 8; i++)
for (j = 1; j < 16; j += 2)

for (i = 0; i < 8; i++) {
}
} while (0)
do {
int i, j;
int e_8[4];
int o_8[4] = { 0 };
for (i = 0; i < 4; i++)
for (j = 1; j < 8; j += 2)

for (i = 0; i < 4; i++) {
}
} while (0)
512 128 32

TR_32 (512) TR_16 (128) TR_8 (32) TR_4 (8)
X 32
= 21,760
X 2 = 43,520
GOLD 버전 연산횟수

GOLD Ver. vs CUDA Ver.
: 512 threads
= 512
= 1
for (i = 0; i < 16; i++)
for (j = 1; j < 32; j += 2)
i
j

TR_32 (512) TR_16 (128) TR_8 (32) TR_4 (8)
X 32
= 21,760
TR_32 (1) TR_16 (1) TR_8 (1) TR_4 (8)
X 32
= 352
GOLD Ver.
CUDA Ver.

Ver. DST 32 X 32 16 X 16 8 X 8
GOLD 43,520 5376 640
CUDA 704 320 144
X 61.8 X 16.8 X 4.4
연산횟수 비교
* o_n[i] += transform[n’ * j][i] * src[j * sstep]; 
이를 한 때에 연산한 총 횟수를 의미합니다.

주파수 변조 최적화 과정
• makeﬁle
• libavcodec 
↳ makeﬁle 
↳ hevcdsp.h 
↳ hevcdsp.c 
↳ hevcdsp_template.c 
↳ (+) hevcdsp_CUDA_functions.cu
* 모든 CUDA 함수는 hevcdsp_CUDA_functions.cu 에  
정의 되어 있습니다. hevedsp.c, hevcdsp_tmplate.c 
는 이 곳에서 CUDA 함수를 가져와 사용합니다.

최상위 makefile
LIBS-ffmpeg += -L /usr/local/cuda/lib64 -lcudart
LIBS-ffprobe += -L /usr/local/cuda/lib64 -lcudart
LIBS-ffserver += -L /usr/local/cuda/lib64 -lcudart
••••••
libavcodec/hevcdsp_CUDA_functions.o: libavcodec/hevcdsp_CUDA_functions.cu
/usr/local/cuda-5.5/bin/nvcc -G -g -O0 -gencode arch=compute_10,code=sm_10 -odir "."
-M -o "libavcodec/hevcdsp_CUDA_functions.d" “libavcodec/hevcdsp_CUDA_functions.cu"
/usr/local/cuda-5.5/bin/nvcc --compile -G -O0 -g -gencode
arch=compute_10,code=compute_10 -gencode arch=compute_10,code=sm_10 -x cu -o
"libavcodec/hevcdsp_CUDA_functions.o" "libavcodec/hevcdsp_CUDA_functions.cu"
* ffmpeg 등이 cuda-template을 사용할 수 있게 해 줍니다.
* .cu file 의 빌드지정은 최상위 makefile에서 지시합니다.

libavcodec - makeﬁle
OBJS-$(CONFIG_HEVC_DECODER) += hevcdsp_CUDA_functions.o

libavcodec - hevcdsp.h
// (Yoon) bgn ...
void DP_Copy_transform_ToCudaMem();
void DP_Free_transform_FromCudaMem();
void DP_TR8_Add(int8_t *T, int8_t *S, int8_t *O, int
sstep);
••••••
// (Yoon) ... end
* hevcdsp_CUDA_functions.cu에 선언되어 함수는  
hevcdsp.h에 선언하여 사용할 수 있게 합니다.

libavcodec - hevcdsp_template.c
do {
int e_8[4];
int o_8[4] = { 0 };

DP_Copy_src_ToCudaMem(8, sstep);
DP_TR8_Add(o_8, sstep);
DP_Free_src_FromCudaMem();


for (i = 0; i < 4; i++) {
}
} while (0)

libavcodec - hevcdsp_CUDA_functions.cu
• void DP_Copy_transform_ToCudaMem( )
• void DP_TR8_Add (int *o_8, int sstep)
• __global__ void TR8_PARALLEL_ADD (int8_t *T, int8_t
*S, int8_t *O, int sstep)

프로세서 CORE i5-3230M GeForce GT 740M
작동클럭 2.60 GHz 1.03 GHz
코어갯수 2 Cores 384 CUDA Cores
연산성능비교
384
CORE
vs
* CPU 성능은 Windows PC정보 를 통해,
GPU 성능은 deviceQuery.exe 를 통해 확인.

사용코드
MatrixMul.cu

- 2차원 행렬(16x16) 곱셈
BIG-OH NOTATION O(n^3) O(n)
연산횟수 4096 16
연산성능비교
384
CORE
vs
* 16 x 16 = 256

* VS 디버그 모드로 Build.

* GPU : 65536 Thread

(256 grid, 256 block)

1 2 3 평균시간
91ms
4ms384
CORE

1 2 3 평균시간
91ms 77ms
4ms 4ms384
CORE

1 2 3 평균시간
91ms 77ms 76ms
4ms 4ms 4ms384
CORE

1 2 3 평균시간
91ms 77ms 76ms 81.3ms
4ms 4ms 4ms 4ms384
CORE
연산성능비교
2차원 행렬(16x16) 곱셈

0 30 60 90
평균시간
81.3ms X 1
4ms X 20384
CORE
연산성능비교
2차원 행렬(16x16) 곱셈

hevcdsp_template_CUDA.cu
14. 4. 1 (화)
http://en.wikipedia.org/wiki/Discrete_cosine_transform

…
• 오버헤드 최적화 
cudaMemcpy -> cudaHostRegister
• 연산최적화(cuBLAS) 
simplMul->cublasSgemm

행렬곱셈연산최적화
matrixMul.cu
matrixMulCUBLAS.cpp
CUDA CUBLAS
Performance 10.33 236.73 23
Time 12.693 0.554 23
x
x

�ݺ�ߣ

ffmpeg optimization using CUDA

Recommended

More Related Content

What's hot (20)

Similar to ffmpeg optimization using CUDA (20)

More from yyooooon (8)

ffmpeg optimization using CUDA