Code GPU with CUDA

CUDA

Introduction

Created by Marina Kolpakova ( cuda.geek ) for Itseez

Outline

Terminology
Definition
Programming model
Execution model
Memory models
CUDA kernel

Out of scope

CUDA API overview

Terminology

Device: CUDA-capable NVIDIA GPU
Device code: code executed on the device
Host: x86/x64/arm CPU
Host code: code executed on the host
Kernel: concrete device function

CUDA

CUDA is a Compute Unified Device Arhitecture.

CUDA includes:
1. Capable GPU hardware and driver
2. Device ISA, GPU assembler, Compiler
3. C++ based HL language, CUDA Runtime
CUDA defines:
- programming model
- execution model
- memory model

Programming model

Kernel is executed by many threads

Programming model

Threads are grouped into blocks

Each thread has a thread ID

Programming model

Thread blocks form an execution grid

Each block has a block ID

Execution (HW mapping) model

Single thread is executed on core

Execution (HW mapping) model

Each block is executed by one SM and does not migrate
Number of concurrent blocks that can reside on SM depends on available resources

Execution (HW mapping) model

Threads in a block can cooperate via shared memory and synchronization
There is no hardware support for cooperation between threads from different blocks

Execution (HW mapping) model

One or multiple (sm_20+) kernels are executed on the device

Memory model

Thread has its own registers

Memory model

Thread has its own local memory

Memory model

Block has shared memory
Pointer to shared memory is valid while block is resident


  __shared__ float buffer[CTA_SIZE];

Memory model

Grid is able to access global and constant memory

Basic CUDA kernel

Work for GPU threads represented as kernel
kernel represents a task for single thread (scalar notation)
Every thread in a particular grid executes the same kernel
Threads use their threadIdx and blockIdx to dispatch work

Kernel function is marked with __global__ keyword


__global__ void kernel(float *in, float *out)
{
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    out[tid] = in[tid];
}

Common kernel structure:
1. Retrieving position in grid (widely named tid)
2. Loading data form GPU’s memory
3. Performing compute work
4. Writing back the result into GPU’s memory

Kernel execution

void execute_kernel(const* float host_in, float* host_out, int size)
{
  float* device_in, * device_out;
  cudaMalloc((void**)&device_in,  size*sizeof(float));
  cudaMalloc((void**)&device_out, size*sizeof(float));

  // 1. Upload data into device memory
  cudaMemcpy(device_in, host_in, cudaMemcpyHostToDevice);

  // 2. Configure kernel launch
  dim3 block(256);
  dim3 grid(size / 256);

  // 3. Execute kernel
  kernel<<<grid, block>>>(device_in, device_out);

  // 4. Wait till completion
  cudaThreadSynchronize();

  // 5. Download results into host memory
  cudaMemcpy(host_out, device_out, cudaMemcpyDeviceToHost);
}

Final words

CUDA is a set of capable GPU hardware, driver, GPU ISA, GPU assembler, compiler, C++ based HL language and runtime which enables programming of NVIDIA GPU
CUDA function (kernel) is called on a grid of blocks
Kernel runs on unified programmable cores
Kernel is able to access registers and local memory, share memory inside a block of threads and access RAM through global, texture and constant memories

Code GPU with CUDA

CUDA

Introduction

previous

Outline

Out of scope

Terminology

CUDA

Programming model

Programming model

Programming model

Execution (HW mapping) model

Execution (HW mapping) model

Execution (HW mapping) model

Execution (HW mapping) model

Memory model

Memory model

Memory model

Memory model

Basic CUDA kernel

Kernel execution

Final words

THE END

next

BY cuda.geek / 2013–2015