際際滷

際際滷Share a Scribd company logo
OpenCL Kernel
Optimization Tips
Champ Yen (champ.yen@gmail.com)
http://champyen.blogspot.com
ver.20140820
Optimization - a form of balance
Device/Platform
Features
Runtime
Toolchain
Problem
Algorithm
Optimization
Optimization is not only greedy
searching in single direction. It is
more like to find a good balance
point between device, toolchain
and the problem.
Device - Computation
 device type
 cpu - powerful single thread performance
 gpu - many threads, great total throughput
 ISA design
 scalar-based
 vector-based
 # of compute unit/processing elements
 estimate impact of using divergence & barrier
 capability of asynchronous data transfer
Device - Memory
 get basic memory characteristics:
 size
 latency
 throughput
 coalescing effect
 addressing mode
 global memory - unified or not
 local memory - real or not
 penalty of oversize
Toolchain/Runtime
 document/tutorial/guide for debugging, profiling and optimization.
 there is no perfect runtime/toolchain
 profiling/debugging tools.
 it is not always a good idea to debug/optimization on different
platforms.
 automatic optimization MAY NOT HELP the thinking of optimization
 tricky forms of computation/memory operations.
 MAD operations
 memory access mode
Problem/Algorithms
 DATA PARALLEL!
 multi-stages is not always bad.
 doing all things together uses more memory resource in one workitem.
 vectorized is not always a good idea
 use appropriate work group size
 bad memory access pattern, less coalescing
 may cause lower cache hit rate
 less local memory for each workitem
 may be less private memory for each workitem.
 different form of implementation
 do optimization things manually.
 DO NOT relies on automatic features.
Q & A

More Related Content

OpenCL Kernel Optimization Tips

  • 1. OpenCL Kernel Optimization Tips Champ Yen (champ.yen@gmail.com) http://champyen.blogspot.com ver.20140820
  • 2. Optimization - a form of balance Device/Platform Features Runtime Toolchain Problem Algorithm Optimization Optimization is not only greedy searching in single direction. It is more like to find a good balance point between device, toolchain and the problem.
  • 3. Device - Computation device type cpu - powerful single thread performance gpu - many threads, great total throughput ISA design scalar-based vector-based # of compute unit/processing elements estimate impact of using divergence & barrier capability of asynchronous data transfer
  • 4. Device - Memory get basic memory characteristics: size latency throughput coalescing effect addressing mode global memory - unified or not local memory - real or not penalty of oversize
  • 5. Toolchain/Runtime document/tutorial/guide for debugging, profiling and optimization. there is no perfect runtime/toolchain profiling/debugging tools. it is not always a good idea to debug/optimization on different platforms. automatic optimization MAY NOT HELP the thinking of optimization tricky forms of computation/memory operations. MAD operations memory access mode
  • 6. Problem/Algorithms DATA PARALLEL! multi-stages is not always bad. doing all things together uses more memory resource in one workitem. vectorized is not always a good idea use appropriate work group size bad memory access pattern, less coalescing may cause lower cache hit rate less local memory for each workitem may be less private memory for each workitem. different form of implementation do optimization things manually. DO NOT relies on automatic features.