Code GPU with CUDA

CUDA

Introduction




Created by Marina Kolpakova ( cuda.geek ) for Itseez

previous

Outline

  • Terminology
  • Definition
  • Programming model
  • Execution model
  • Memory models
  • CUDA kernel

Out of scope

  • CUDA API overview

Terminology

Device
CUDA-capable NVIDIA GPU
Device code
code executed on the device
Host
x86/x64/arm CPU
Host code
code executed on the host
Kernel
concrete device function

CUDA

CUDA is a Compute Unified Device Arhitecture.
  • CUDA includes:
    1. Capable GPU hardware and driver
    2. Device ISA, GPU assembler, Compiler
    3. C++ based HL language, CUDA Runtime
  • CUDA defines:
    • programming model
    • execution model
    • memory model

Programming model

Kernel is executed by many threads

0

Programming model

Threads are grouped into blocks

0

Each thread has a thread ID

Programming model

Thread blocks form an execution grid

0

Each block has a block ID

Execution (HW mapping) model

Single thread is executed on core

0

Execution (HW mapping) model

  • Each block is executed by one SM and does not migrate
  • Number of concurrent blocks that can reside on SM depends on available resources
block on SM

Execution (HW mapping) model

  • Threads in a block can cooperate via shared memory and synchronization
  • There is no hardware support for cooperation between threads from different blocks
block on SM

Execution (HW mapping) model

One or multiple (sm_20+) kernels are executed on the device

gid on GPU

Memory model

Thread has its own registers

0

Memory model

Thread has its own local memory

0

Memory model

  • Block has shared memory
  • Pointer to shared memory is valid while block is resident
0

  __shared__ float buffer[CTA_SIZE];
            

Memory model

Grid is able to access global and constant memory

0

Basic CUDA kernel

  • Work for GPU threads represented as kernel
  • kernel represents a task for single thread (scalar notation)
  • Every thread in a particular grid executes the same kernel
  • Threads use their threadIdx and blockIdx to dispatch work
  • Kernel function is marked with __global__ keyword
    
    __global__ void kernel(float *in, float *out)
    {
        int tid = blockIdx.x * blockDim.x + threadIdx.x;
        out[tid] = in[tid];
    }
    
    
  • Common kernel structure:
    1. Retrieving position in grid (widely named tid)
    2. Loading data form GPU’s memory
    3. Performing compute work
    4. Writing back the result into GPU’s memory

Kernel execution

void execute_kernel(const* float host_in, float* host_out, int size)
{
  float* device_in, * device_out;
  cudaMalloc((void**)&device_in,  size*sizeof(float));
  cudaMalloc((void**)&device_out, size*sizeof(float));

  // 1. Upload data into device memory
  cudaMemcpy(device_in, host_in, cudaMemcpyHostToDevice);

  // 2. Configure kernel launch
  dim3 block(256);
  dim3 grid(size / 256);

  // 3. Execute kernel
  kernel<<<grid, block>>>(device_in, device_out);

  // 4. Wait till completion
  cudaThreadSynchronize();

  // 5. Download results into host memory
  cudaMemcpy(host_out, device_out, cudaMemcpyDeviceToHost);
}

Final words

  • CUDA is a set of capable GPU hardware, driver, GPU ISA, GPU assembler, compiler, C++ based HL language and runtime which enables programming of NVIDIA GPU
  • CUDA function (kernel) is called on a grid of blocks
  • Kernel runs on unified programmable cores
  • Kernel is able to access registers and local memory, share memory inside a block of threads and access RAM through global, texture and constant memories

THE END

next



BY cuda.geek / 2013–2015