The document provides optimization tips for OpenCL kernels that focus on finding a balance between the device, toolchain, and problem being solved. It discusses considering the device type and memory characteristics, using profiling tools suited for the target platform, ensuring the problem is data parallel, and manually optimizing aspects like work group size and memory access patterns rather than relying on automatic features. Optimization requires understanding tradeoffs between these elements rather than taking a single-minded approach.
2. Optimization - a form of balance
Device/Platform
Features
Runtime
Toolchain
Problem
Algorithm
Optimization
Optimization is not only greedy
searching in single direction. It is
more like to find a good balance
point between device, toolchain
and the problem.
3. Device - Computation
device type
cpu - powerful single thread performance
gpu - many threads, great total throughput
ISA design
scalar-based
vector-based
# of compute unit/processing elements
estimate impact of using divergence & barrier
capability of asynchronous data transfer
4. Device - Memory
get basic memory characteristics:
size
latency
throughput
coalescing effect
addressing mode
global memory - unified or not
local memory - real or not
penalty of oversize
5. Toolchain/Runtime
document/tutorial/guide for debugging, profiling and optimization.
there is no perfect runtime/toolchain
profiling/debugging tools.
it is not always a good idea to debug/optimization on different
platforms.
automatic optimization MAY NOT HELP the thinking of optimization
tricky forms of computation/memory operations.
MAD operations
memory access mode
6. Problem/Algorithms
DATA PARALLEL!
multi-stages is not always bad.
doing all things together uses more memory resource in one workitem.
vectorized is not always a good idea
use appropriate work group size
bad memory access pattern, less coalescing
may cause lower cache hit rate
less local memory for each workitem
may be less private memory for each workitem.
different form of implementation
do optimization things manually.
DO NOT relies on automatic features.