Code GPU with CUDA

Device code optimization principle

Created by Marina Kolpakova ( cuda.geek ) for Itseez

Outline

Optimization principle
Performance limiters
Little’s law
TLP & ILP

Device code optimization principle

Specific of SIMT architecture makes GPU to be latent at all, so

Hiding latency is the only GPU-specific optimization principle

Typical latencies for Kepler generation
- register writeback: ~10 cycles
- L1: ~34 cycles
- Texture L1: ~96 cycles
- L2: ~160 cycles
- Global memory: ~350 cycles

Performance limiters

Optimize for GPU ≃ Optimize for latency

Factors that pervert latency hiding:

Insufficient parallelism
Inefficient memory accesses
Inefficient control flow

Throughput & Latency

Throughput: is how many operations are performed in one cycle
Latency: is how many cycles pipeline stalls before another dependent operation
Inventory: is a number of warps on fly i.e. in execution stage of the pipeline

Little’s law

L = λ × W

Inventory (L) = Throughput (λ) × Latency (W)

Example: GPU with 8 operations per clock and 18 clock latency

Little’s law: FFMA example

Fermi GF100
- Throughput: 32 operations per clock (1 warp)
- Latency: ~18 clocks
- Maximum resident warps per SM: 24
- Inventory: 1 * 18 = 18 warps on fly
Kepler GK110
- Throughput: 128 (if no ILP) operations per clock (4 warps)
- Latency: ~10 clocks
- Maximum resident warps per SM: 64
- Inventory: 4 * 10 = 40 warps on fly
Maxwell GM204
- Throughput: 128 operations per clock (4 warps)
- Latency: ~6 clocks
- Maximum resident warps per SM: 64
- Inventory: 4 * 6 = 24 warps on fly

TLP & ILP

Thread Level Parallelism
- enabling factors:
  - sufficient number of warps per SM on fly
- limiting factors:
  - bad launch configuration
  - resource consuming kernels
  - poorly parallelized code
Instruction Level Parallelism
- enabling factors:
  - independent instructions per warp
  - dual issue capabilities
- limiting Factors:
  - structural hazards
  - data hazards

Improving TLP

Occupancy: is actual number of warps running concurrently on a multiprocessor divided by maximum number of warps that can be run concurrently by hardware
Improve occupancy to achieve better TLP

Modern GPUs can keep up to 64 resident warps belonging to 16(Kepler)/32(Maxwell) blocks BUT you need recourses for them: registers, smem
Kepler has 64 K. × 32-bit registers and 32-lane wide warp

65536 registers / 64 warps / 32 warp_size = 32 registers / thread

Improving ILP

Kernel unrolling: process more elements by thread, because operations on different elements are independent


__global__ void unrolled(const float* in, float* out )
{
  const int tid = blockDim.x * blockIdx.x + threadIdx.x;
  const int totalThrads = blockDim.x * gridDim.x;
  out[tid]               = process(in[tid]);
  out[tid + totalThrads] = process(in[tid + totalThrads]);
}

Device code compiler is not bad in instruction reordering

Loop unrolling in device code to increase number of independent operations


#pragma unroll CONST_EXPRESSION
for( int i = 0; i < N_ITERATIONS; i++ ) { /* ... */ }

Other techniques used for increasing ILP on CPU are suitable

ILP on modern GPUs

ILP is a mast-have for older architectures, but still help to hide pipeline latencies on modern GPUs

Maxwell: 4 warp schedulers dual-issue each. 128 compute cores process up to 4 warps each clock. Compute cores utilization: 1.0
Kepler: 4 warp schedulers, dual-issue each. 192 compute cores process up to 6 warps each clock. If there is no ILP only 128 of 192 cores are used. Compute cores utilization: 0.6(6)
Fermi (sm_21): 2 warp schedulers, dual-issue each. 48 compute cores process 3 warps each 2 clock. If there is no ILP only 32 of 48 cores are used. Compute cores utilization: 0.6(6)

Final words

GPU optimization principles:
- Principle #1: hide latency
- Principle #2: see principle #1

Code GPU with CUDA

Device code optimization principle

previous

Outline

Device code optimization principle

Performance limiters

Throughput & Latency

Little’s law

Little’s law: FFMA example

TLP & ILP

Improving TLP

Improving ILP

ILP on modern GPUs

Final words

THE END

next

BY cuda.geek / 2013–2015