Code GPU with CUDA

Device code optimization principle




Created by Marina Kolpakova ( cuda.geek ) for Itseez

previous

Outline

  • Optimization principle
  • Performance limiters
  • Little’s law
  • TLP & ILP

Device code optimization principle

Specific of SIMT architecture makes GPU to be latent at all, so

Hiding latency is the only GPU-specific optimization principle

  • Typical latencies for Kepler generation
    • register writeback: ~10 cycles
    • L1: ~34 cycles
    • Texture L1: ~96 cycles
    • L2: ~160 cycles
    • Global memory: ~350 cycles

Performance limiters

Optimize for GPU Optimize for latency

Factors that pervert latency hiding:

  • Insufficient parallelism
  • Inefficient memory accesses
  • Inefficient control flow

Throughput & Latency

Throughput
is how many operations are performed in one cycle
Latency
is how many cycles pipeline stalls before another dependent operation
Inventory
is a number of warps on fly i.e. in execution stage of the pipeline

Little’s law

L = λ × W

Inventory (L) = Throughput (λ) × Latency (W)

Example: GPU with 8 operations per clock and 18 clock latency

Little’s law: FFMA example

  • Fermi GF100
    • Throughput: 32 operations per clock (1 warp)
    • Latency: ~18 clocks
    • Maximum resident warps per SM: 24
    • Inventory: 1 * 18 = 18 warps on fly
  • Kepler GK110
    • Throughput: 128 (if no ILP) operations per clock (4 warps)
    • Latency: ~10 clocks
    • Maximum resident warps per SM: 64
    • Inventory: 4 * 10 = 40 warps on fly
  • Maxwell GM204
    • Throughput: 128 operations per clock (4 warps)
    • Latency: ~6 clocks
    • Maximum resident warps per SM: 64
    • Inventory: 4 * 6 = 24 warps on fly

TLP & ILP

  • Thread Level Parallelism
    • enabling factors:
      • sufficient number of warps per SM on fly
    • limiting factors:
      • bad launch configuration
      • resource consuming kernels
      • poorly parallelized code
  • Instruction Level Parallelism
    • enabling factors:
      • independent instructions per warp
      • dual issue capabilities
    • limiting Factors:
      • structural hazards
      • data hazards

Improving TLP

Occupancy
is actual number of warps running concurrently on a multiprocessor divided by maximum number of warps that can be run concurrently by hardware
Improve occupancy to achieve better TLP

  • Modern GPUs can keep up to 64 resident warps belonging to 16(Kepler)/32(Maxwell) blocks BUT you need recourses for them: registers, smem
  • Kepler has 64 K. × 32-bit registers and 32-lane wide warp

65536 registers / 64 warps / 32 warp_size = 32 registers / thread

Improving ILP

  • Kernel unrolling: process more elements by thread, because operations on different elements are independent
    
    __global__ void unrolled(const float* in, float* out )
    {
      const int tid = blockDim.x * blockIdx.x + threadIdx.x;
      const int totalThrads = blockDim.x * gridDim.x;
      out[tid]               = process(in[tid]);
      out[tid + totalThrads] = process(in[tid + totalThrads]);
    }
    
    
  • Device code compiler is not bad in instruction reordering
  • Loop unrolling in device code to increase number of independent operations
    
    #pragma unroll CONST_EXPRESSION
    for( int i = 0; i < N_ITERATIONS; i++ ) { /* ... */ }
    
    
  • Other techniques used for increasing ILP on CPU are suitable

ILP on modern GPUs

ILP is a mast-have for older architectures, but still help to hide pipeline latencies on modern GPUs

  • Maxwell: 4 warp schedulers dual-issue each. 128 compute cores process up to 4 warps each clock. Compute cores utilization: 1.0
  • Kepler: 4 warp schedulers, dual-issue each. 192 compute cores process up to 6 warps each clock. If there is no ILP only 128 of 192 cores are used. Compute cores utilization: 0.6(6)
  • Fermi (sm_21): 2 warp schedulers, dual-issue each. 48 compute cores process 3 warps each 2 clock. If there is no ILP only 32 of 48 cores are used. Compute cores utilization: 0.6(6)

Final words

  • GPU optimization principles:
    • Principle #1: hide latency
    • Principle #2: see principle #1

THE END

next



BY cuda.geek / 2013–2015