ºÝºÝߣ

ºÝºÝߣShare a Scribd company logo
A Practical Quicksort Algorithmfor Graphics ProcessorsDaniel Cederman and Philippas TsigasDistributed Computing and SystemsChalmers University of Technology
System ModelThe AlgorithmExperimentsOverview
System ModelThe AlgorithmExperimentsSystem Model
CPU vs. GPUNormal processorsGraphics processorsMulti-coreLarge cacheFew threadsSpeculative executionMany-coreSmall cacheWide and fast memory busThousands of threads hides memory latency
Designed for general purpose computationExtensions to C/C++CUDA
System Model ¨C Global MemoryGlobal Memory
System Model ¨C Shared MemoryGlobal MemoryMultiprocessor 0Multiprocessor 1SharedMemorySharedMemory
System Model ¨C Thread BlocksGlobal MemoryMultiprocessor 0Multiprocessor 1Thread Block 0Thread Block 1Thread Block xThread Block ySharedMemorySharedMemory
Minimal, manual cache32-word SIMD instructionCoalesced memory accessNo block synchronizationExpensive synchronization primitivesChallenges
System ModelThe AlgorithmExperimentsThe Algorithm
Quicksort
QuicksortData Parallel!
QuicksortNOTData Parallel!
QuicksortPhase OneSequences divided
Block synchronization on the CPUThread Block 1Thread Block 2
QuicksortPhase TwoEach block is assigned own sequence
Run entirely on GPUThread Block 1Thread Block 2
Quicksort ¨C Phase One2 2 0 7 4 9 3 3 3 7 8 4 8 7 2 5 0 7 6 3 4 2 9 8
Quicksort ¨C Phase OneAssign Thread Blocks2 2 0 7 4 9 3 3 3 7 8 4 8 7 2 5 0 7 6 3 4 2 9 8
Quicksort ¨C Phase OneUses an auxiliary arrayIn-place too expensive2 2 0 7 4 9 3 3 3 7 8 4 8 7 2 5 0 7 6 3 4 2 9 8
Quicksort ¨C Synchronization2 2 0 7 4 9 3 3 3 7 8 4 8 7 2 5 0 7 6 3 4 2 9 8LeftPointer = 01Thread Block ONEFetch-And-AddonLeftPointer0Thread BlockTWO
Quicksort ¨C Synchronization22 0 7 4 9 3 3 3 7 8 4 8 7 2 5 0 7 6 3 4 2 9 832LeftPointer = 22Thread Block ONEFetch-And-AddonLeftPointer3Thread BlockTWO
Quicksort ¨C Synchronization220 7 4 9 3 3 3 7 8 4 8 7 2 5 0 7 6 3 4 2 9 83223LeftPointer = 45Thread Block ONEFetch-And-AddonLeftPointer4Thread BlockTWO
Quicksort ¨C Synchronization2 2 0 7 4 9 3 3 3 7 8 48 7 2 5 0 7 6 3 4 2 9 8322330239778785769802LeftPointer = 10
Quicksort ¨C Synchronization2 2 0 7 4 9 3 3 3 7 8 4 8 7 2 5 0 7 6 3 4 2 9 8322330239477878576984402Three way partition
Two Pass Partitioning2 2 0 7 4 9 3 3 3 7 8 4 8 7 2 5 0 7 6 3 4 2 9 8<<<<<<<<<<===>>>>>>>>>>>===
Recurse on smallest sequenceMax depth log(n)Explicit stackAlternative sortingBitonic sortSecond Phase
System ModelThe AlgorithmExperimentsThe Algorithm
GPU-Quicksort		Cederman and Tsigas, ESA08GPUSort			Govindaraju et. al., SIGMOD05Radix-Merge		Harris et. al., GPU Gems 3 ¡¯07Global radix		Sengupta et. al., GH07Hybrid			Sintorn and Assarsson, GPGPU07Introsort 			David Musser, Software: 						Practice and ExperienceSet of Algorithms
UniformGaussianBucketSortedZeroStanfordModelsDistributions
UniformGaussianBucketSortedZeroStanfordModelsDistributions
UniformGaussianBucketSortedZeroStanfordModelsDistributions
UniformGaussianBucketSortedZeroStanfordModelsDistributions
UniformGaussianBucketSortedZeroStanfordModelsDistributions
UniformGaussianBucketSortedZeroStanfordModelsDistributionsx1x2x3Px4
First PhaseAverage of maximum and minimum elementSecond PhaseMedian of first, middle and last elementPivot selection
8800GTX16 multiprocessors (128 cores)86.4 GB/s bandwidth8600GTS4 multiprocessors (32 cores)32 GB/s bandwidthCPU-ReferenceAMD Dual-Core Opteron 265 / 1.8 GHzHardware
8800GTX ¨C Uniform Distribution
8800GTX ¨C Uniform Distribution
8800GTX ¨C Uniform ¨C 16M
8800GTX ¨C Uniform ¨C 16MCPU-Reference~4s
8800GTX ¨C Uniform ¨C 16MGPUSort andRadix-Merge~1.5s
8800GTX ¨C Uniform ¨C 16MGPU-Quicksort~380msGlobal Radix~510ms
8800GTX ¨C Sorted Distribution
8800GTX ¨C Sorted Distribution
8800GTX ¨C Sorted ¨C 8M
8800GTX ¨C Sorted ¨C 8MReference is faster than GPUSort and Radix-Merge
8800GTX ¨C Sorted ¨C 8MGPU-Quicksorttakes less than200ms
8800GTX ¨C Zero Distribution
8800GTX ¨C Zero Distribution
8800GTX ¨C Zero ¨C 16M
8800GTX ¨C Zero ¨C 16MBitonic is unaffected by distribution
8800GTX ¨C Zero ¨C 16MThree-way partition
8600GTS ¨C Uniform Distribution
8600GTS ¨C Uniform Distribution
8600GTS ¨C Uniform ¨C 8M
8600GTS ¨C Uniform ¨C 8MReference equal to Radix-Merge
8600GTS ¨C Uniform ¨C 8MQuicksort and hybrid ~4 times faster
8600GTS ¨C Sorted Distribution
8600GTS ¨C Sorted Distribution
8600GTS ¨C Sorted ¨C 8M
8600GTS ¨C Sorted ¨C 8MHybrid needs randomization
8600GTS ¨C Sorted ¨C 8MReference better
Visibility Ordering ¨C 8800GTX
Visibility Ordering ¨C 8800GTXSimilar to uniform distribution
Visibility Ordering ¨C 8800GTXSequence needs to be a power of two in length
Minimal, manual cacheUsed only for prefix sum and bitonic32-word SIMD instructionMain part executes same instructionsCoalesced memory accessAll reads coalescedNo block synchronizationOnly required in first phaseExpensive synchronization primitivesTwo passes amortizes costConclusions

More Related Content

GPU-Quicksort