CUDA - Keywords and Thread Organization


In this chapter, we will discuss the keywords and thread organisation in CUDA.

The following keywords are used while declaring a CUDA function. As an example, while declaring the kernel, we have to use the __global__ keyword. This provides a hint to the compiler that this function will be executed on the device and can be called from the host.

__device__ float function() GPU (device) CPU (host)
__global__ void function() CPU (host) GPU (device)
__host__float function() GPU (device) GPU (device)

A Sample Cuda C Code

In this section, we will see a sample CUDA C Code.

void vecAdd(float* A, float* B, float* C,int N) {
   int size=N*sizeOf(float);
   float *d_A,*d_B,*d_C;

This helps in allocating memory on the device for storing vector A. When the function returns, we get a pointer to the starting location of the memory.


Copy data from host to device. The host memory contains the contents of vector A. Now that we have allocated space on the device for storing vector A, we transfer the contents of vector A to the device. At this point, the GPU memory has vector A stored, ready to be operated upon.


//Similar to A

This helps in allocating the memory on the GPU to store the result vector C. We will cover the Kernel launch statement later.


After all the threads have finished executing, the result is stored in d_C (d stands for device). The host copies the result back to vector C from d_C.


This helps to free up the memory allocated on the device using cudaMalloc().

This is how the above program works −

  • The above program adds the corresponding elements of two vectors X and Y, and stores the final result in a vector Z.

  • The device memory is allocated for storing the input vectors (X and Y) and the result vector (Z).

  • cudaMalloc() − This method is used to allocate memory on the host. in two parameters, the address of a pointer to the allocated object, and the size of the allocated object in terms of bytes.

  • cudaFree() − This method is used to release objects from device memory. It takes in the pointer to the freed object as parameter.

  • cudaMemcpy() − This API function is used for memory data transfer. It requires four parameters as input: Pointer to the destination, pointer to the source, amount of data to be copied (in bytes), and the direction of transfer.

CUDA Thread Organization

Threads in a grid execute the same kernel function. They have specific coordinates to distinguish themselves from each other and identify the relevant portion of data to process. In CUDA, they are organized in a two-level hierarchy: a grid comprises blocks, and each block comprises threads.

For all threads in a block, the block index is the same. The block index parameter can be accessed using the blockIdx variable inside a kernel. Each thread also has an associated index, and it can be accessed by using threadIdx variable inside the kernel. Note that blockIdx and threadIdx are built-in CUDA variables that are only accessible from inside the kernel.

In a similar fashion, CUDA also has gridDim and blockDim variables that are also built-in. They return the dimensions of the grid and block along a particular axis respectively. As an example, blockDim. x can be used to find how many threads a particular block has along the x axis.

CUDA Thread Organization

Let us consider an example to understand the concept explained above. Consider an image, which is 76 pixels along the x axis, and 62 pixels along the y axis. Our aim is to convert the image from sRGB to grayscale. We can calculate the total number of pixels by multiplying the number of pixels along the x axis with the total number along the y axis that comes out to be 4712 pixels. Since we are mapping each thread with each pixel, we need a minimum of 4712 pixels. Let us take number of threads in each direction to be a multiple of 4. So, along the x axis, we will need at least 80 threads, and along the y axis, we will need at least 64 threads to process the complete image. We will ensure that the extra threads are not assigned any work.

Thus, we are launching 5120 threads to process a 4712 pixels image. You may ask, why the extra threads? The answer to this question is that keeping the dimensions as multiple of 4 has many benefits that largely offsets any disadvantages that result from launching extra threads. This is explained in a later section).

Now, we have to divide these 5120 threads into grids and blocks. Let each block have 256 threads. If so, then one possibility that of the dimensions each block are: (16,16,1). This means, there are 16 threads in the x direction, 16 in the y direction, and 1 in the z direction. We will be needing 5 blocks in the x direction (since there are 80 threads in total along the x axis), and 4 blocks in y direction (64 threads along the y axis in total), and 1 block in z direction. So, in total, we need 20 blocks. In a nutshell, the grid dimensions are (5,4,1) and the block dimensions are (16,16,1). The programmer needs to specify these values in the program. This is shown in the figure above.

  • dim3 dimBlock(5,4,1) − To specify the grid dimensions

  • dim3 dimGrid(ceil(n/16.0),ceil(m/16.0),1) − To specify the block dimensions.

  • kernelName<<<dimGrid,dimBlock>>>(parameter1, parameter2, ...) − Launch the actual kernel.

n is the number of pixels in the x direction, and m is the number of pixels in the y direction. ‘ceil’ is the regular ceiling function. We use it because we never want to end up with less number of blocks than required. dim3 is a data structure, just like an int or a float. dimBlock and dimGrid are variables names. The third statement is the kernel launch statement. ‘kernelName’ is the name of the kernel function, to which we pass the parameters: parameter1, parameter2, and so on. <<<>>> contain the dimensions of the grid and the block.