�ݺ�ߣ

A Practical Quicksort Algorithmfor Graphics ProcessorsDaniel Cederman and Philippas TsigasDistributed Computing and SystemsChalmers University of Technology

System ModelThe AlgorithmExperimentsOverview

System ModelThe AlgorithmExperimentsSystem Model

CPU vs. GPUNormal processorsGraphics processorsMulti-coreLarge cacheFew threadsSpeculative executionMany-coreSmall cacheWide and fast memory busThousands of threads hides memory latency

Designed for general purpose computationExtensions to C/C++CUDA

System Model �C Global MemoryGlobal Memory

System Model �C Shared MemoryGlobal MemoryMultiprocessor 0Multiprocessor 1SharedMemorySharedMemory

System Model �C Thread BlocksGlobal MemoryMultiprocessor 0Multiprocessor 1Thread Block 0Thread Block 1Thread Block xThread Block ySharedMemorySharedMemory

Minimal, manual cache32-word SIMD instructionCoalesced memory accessNo block synchronizationExpensive synchronization primitivesChallenges

System ModelThe AlgorithmExperimentsThe Algorithm

QuicksortPhase OneSequences divided

Block synchronization on the CPUThread Block 1Thread Block 2

QuicksortPhase TwoEach block is assigned own sequence

Run entirely on GPUThread Block 1Thread Block 2

Quicksort �C Phase One2 2 0 7 4 9 3 3 3 7 8 4 8 7 2 5 0 7 6 3 4 2 9 8

Quicksort �C Phase OneAssign Thread Blocks2 2 0 7 4 9 3 3 3 7 8 4 8 7 2 5 0 7 6 3 4 2 9 8

Quicksort �C Phase OneUses an auxiliary arrayIn-place too expensive2 2 0 7 4 9 3 3 3 7 8 4 8 7 2 5 0 7 6 3 4 2 9 8

Quicksort �C Synchronization2 2 0 7 4 9 3 3 3 7 8 4 8 7 2 5 0 7 6 3 4 2 9 8LeftPointer = 01Thread Block ONEFetch-And-AddonLeftPointer0Thread BlockTWO

Quicksort �C Synchronization22 0 7 4 9 3 3 3 7 8 4 8 7 2 5 0 7 6 3 4 2 9 832LeftPointer = 22Thread Block ONEFetch-And-AddonLeftPointer3Thread BlockTWO

Quicksort �C Synchronization220 7 4 9 3 3 3 7 8 4 8 7 2 5 0 7 6 3 4 2 9 83223LeftPointer = 45Thread Block ONEFetch-And-AddonLeftPointer4Thread BlockTWO

Quicksort �C Synchronization2 2 0 7 4 9 3 3 3 7 8 48 7 2 5 0 7 6 3 4 2 9 8322330239778785769802LeftPointer = 10

Quicksort �C Synchronization2 2 0 7 4 9 3 3 3 7 8 4 8 7 2 5 0 7 6 3 4 2 9 8322330239477878576984402Three way partition

Two Pass Partitioning2 2 0 7 4 9 3 3 3 7 8 4 8 7 2 5 0 7 6 3 4 2 9 8<<<<<<<<<<===>>>>>>>>>>>===

Recurse on smallest sequenceMax depth log(n)Explicit stackAlternative sortingBitonic sortSecond Phase

GPU-Quicksort Cederman and Tsigas, ESA08GPUSort Govindaraju et. al., SIGMOD05Radix-Merge Harris et. al., GPU Gems 3 ��07Global radix Sengupta et. al., GH07Hybrid Sintorn and Assarsson, GPGPU07Introsort David Musser, Software: Practice and ExperienceSet of Algorithms

UniformGaussianBucketSortedZeroStanfordModelsDistributions

UniformGaussianBucketSortedZeroStanfordModelsDistributionsx1x2x3Px4

First PhaseAverage of maximum and minimum elementSecond PhaseMedian of first, middle and last elementPivot selection

8800GTX16 multiprocessors (128 cores)86.4 GB/s bandwidth8600GTS4 multiprocessors (32 cores)32 GB/s bandwidthCPU-ReferenceAMD Dual-Core Opteron 265 / 1.8 GHzHardware

8800GTX �C Uniform Distribution

8800GTX �C Uniform �C 16MCPU-Reference~4s

8800GTX �C Uniform �C 16MGPUSort andRadix-Merge~1.5s

8800GTX �C Uniform �C 16MGPU-Quicksort~380msGlobal Radix~510ms

8800GTX �C Sorted Distribution

8800GTX �C Sorted �C 8MReference is faster than GPUSort and Radix-Merge

8800GTX �C Sorted �C 8MGPU-Quicksorttakes less than200ms

8800GTX �C Zero Distribution

8800GTX �C Zero �C 16MBitonic is unaffected by distribution

8800GTX �C Zero �C 16MThree-way partition

8600GTS �C Uniform Distribution

8600GTS �C Uniform �C 8MReference equal to Radix-Merge

8600GTS �C Uniform �C 8MQuicksort and hybrid ~4 times faster

8600GTS �C Sorted Distribution

8600GTS �C Sorted �C 8MHybrid needs randomization

8600GTS �C Sorted �C 8MReference better

Visibility Ordering �C 8800GTX

Visibility Ordering �C 8800GTXSimilar to uniform distribution

Visibility Ordering �C 8800GTXSequence needs to be a power of two in length

Minimal, manual cacheUsed only for prefix sum and bitonic32-word SIMD instructionMain part executes same instructionsCoalesced memory accessAll reads coalescedNo block synchronizationOnly required in first phaseExpensive synchronization primitivesTwo passes amortizes costConclusions

�ݺ�ߣ

GPU-Quicksort

More Related Content

GPU-Quicksort