Category: C++

Cuda tutorial Introduction 1




Installation

The easiest way to insatll CUDA is to use the standard installation form a package

sudo apt install nvidia-cuda-toolkit
sudo apt-get install libglu1-mesa libxi-dev libxmu-dev libglu1-mesa-dev

On Ubuntu 18.04 it will install the Cuda version 9.

Vocabulary

A small definition of concept is necessary to understand the CUDA architecture.
Host An another word to say the CPU of the running machine.
Device: The GPU itself
Kernel Function or portion of code that runs on a grid
GridSet of blocks
BlocksSet of threads
ThreadsSmallest unit of execution
Ouch.
Surprisingly the smallest unit of a GPU is called a thread (so a different meaning from Linux vocabulary) but the power of a GPU os it can start all threads in one or two instruction and synchronization is done directly on the card, whithout any intervention from the developer.

Example

To use this new concept, the Nvidia add new directive or keywords in the C++ syntax. To use them you must use the Nvidia compiler nvcc.

  • CUDA C keyword __global__ indicates that a function runs on the device called from host code (value by default to execute a kernel on a GPU)
  • CUDA C keyword __device__ indicates that a function runs on the device called from a device code
  • CUDA C keyword __host__ indicates that a function runs on the host (the main CPU)

After all theses consideration, let’s start with a small example.

#include "performancetiming.hpp"
#include 
#include 

// function to add the elements of two arrays
void add(int n, float *x, float *y)
{
  for (int i = 0; i < n; i++)
      y[i] = x[i] + y[i];
}

int main(void)
{
  int N = 1<<24; // 16M elements

  float *x = new float[N];
  float *y = new float[N];

  // initialize x and y arrays on the host
  for (int i = 0; i < N; i++) {
    x[i] = 1.0f;
    y[i] = 2.0f;
  }

  // Run kernel on 16M elements on the CPU

  add(N, x, y);


  float maxError = 0.0f;
  for (int i = 0; i < N; i++)
    maxError = fmax(maxError, fabs(y[i]-3.0f));
  std::cout << "Max error: " << maxError << std::endl;

  // Free memory
  delete [] x;
  delete [] y;

  return 0;
}

This code is adding number one by one all elements of two arrays and store it in the second. The main processing part is done in add function. Each result is independent so we can

Applications and examples for Timing library

Hi,

as you’ve already know I’ve developed a small library to asset performance of code. The aim of this library is to objectively report how long a function last. It would help you choose an implementation of an algorithm for example.
To illustrate that I will take fibonnaci function as an example. I know at least two implementations of this function, one functional and an another iterative. The strong point of the functional version is that it is easier to implement and to understand as contrary to the iterative. But for large number I suspect the functional version to be very slow but I can not say to what point.

int __inline__ fibonacci(int n)
{
      if (n < 3)
            return 1;
      return fibonacci(n - 1) + fibonacci(n - 2);
}

int __inline__ fibonacciOptim(int n)
{
      int first = 0, second = 1, next, c;
      for (c = 0; c < n; c++)
      {
            if (c <= 1)
                  next = c;
            else
            {
                  next = first + second;
                  first = second;
                  second = next;
            }
      }
      return next;
}

It is now time to use the Timing Library
First clone the repository

git clone https://github.com/fflayol/timingperf.git
cd timingperf

Then create a main to set values of the library and add the different version of your code:

#include "performancetiming.hpp"
int __inline__ fibonacci(int n)
{
      if (n < 3)
            return 1;
      return fibonacci(n - 1) + fibonacci(n - 2);
}

int __inline__ fibonacciOptim(int n)
{
      int first = 0, second = 1, next, c;
      for (c = 0; c < n; c++)
      {
            if (c <= 1)
                  next = c;
            else
            {
                  next = first + second;
                  first = second;
                  second = next;
            }
      }
      return next;
}

int main(int argc, char **argv)
{
      timing::addFunction("FIB", fibonacci);
      timing::addFunction("FIB-OPTIM", fibonacciOptim);
      timing::setTimingFunction(2);
      timing::setNumberExecution(1000000);
      timing::Execute(15);
      timing::CalcResult();
}

The result gives:

fflayol@local:~/Perso/timingperf$ ./ex1.out 
Begin [ FIB ]
End   [ FIB ]
Begin [ FIB-OPTIM ]
End   [ FIB-OPTIM ]
Begin [ FIB ]
End [ FIB ]
Begin [ FIB-OPTIM ]
End [ FIB-OPTIM ]
fflayol@local:~/Perso/timingperf$ ./ex1.out
 Begin [ FIB ] End [ FIB ] 
Begin [ FIB-OPTIM ] End [ FIB-OPTIM ] 
Begin [ FIB ] End [ FIB ] 
Begin [ FIB-OPTIM ] End [ FIB-OPTIM ]
|--------------------------------------------------------------------|
|---Name--------Timer-----Duration------Diff-------Min-------Diff----|
|          |           |           |           |         |           |
|      FIB |     RDTSC |      6326 |      100 %|     5906|     100 % | 
|FIB-OPTIM |     RDTSC |       158 |   4003.8 %|      137| 4310.95 % | 
|      FIB |    RDTSCP |      6354 |  99.5593 %|     5916|  99.831 % | 
|FIB-OPTIM |    RDTSCP |       208 |  3041.35 %|      180| 3281.11 % |

The difference here is very important because our optimization version is 40x faster 🙂
An another nice feature is to check compiler optimization quality. You can change the compiler optimization with -O option followed by a numer (0 to 3, the higher the better).

g++ ex1.cpp -std=c++11 -o ex1.out ; g++ -O3 ex1.cpp -std=c++11 -o ex1-opt.out
|---Name------Timer---Duration------Diff------Min------Diff- |
| FIB      | RDTSC | 2253      | 100 %     | 2080|   100 %   |
|FIB-OPTIM | RDTSC | 31        | 7267.74 % | 24  | 8666.67 % |
| FIB      | RDTSCP| 2304      | 97.7865 % | 2108| 98.6717 % |
|FIB-OPTIM | RDTSCP| 51        | 4417.65 % | 45  | 4622.22 % | |------------------------------------------------------------|

As a result both version are faster (3x for the functional version and 5x with the iterative version). As a result the difference of performance is still higher.

Evaluation of std::chrono

If you have a look around in the web, a solution to correctly measure time is to use a new C++ package: std::chrono , which is part of the standard C++ library.
So the aim of this article is to investigate if this solution can be used to have a very high resolution timer. If you remember well as we are doing small improvement we want to be able to measure the improvement (or degradation). of optimization.

First step is to

#include <chrono>
#include <ratio>
#include <climits>
#include <algorithm>    // std::max
int main()
{
  long long  value = 0;
  double max = LONG_MIN ;
  double min = LONG_MAX;
  for (int i=  1;i<100;i++){

    auto startInitial = std::chrono::high_resolution_clock::now();              
    auto endInitial = std::chrono::high_resolution_clock::now();
    std::chrono::duration<double, std::nano > elapsedInitial = (endInitial - startInitial) ;
    max = std::max(max,elapsedInitial.count());
    min = std::min(min,elapsedInitial.count());
    value=value+elapsedInitial.count();
  }
  std::cout <<"Sum for 100 loop"<<value<<" " <<value/100<<"ns"std::endl;
  std::cout<<" Max:" <<max <<"ns Min:"<<min<<"ns"<<std::endl;
}
fflayol@:/tmp$ g++ test1.c  -std=c++11;./a.out 
Sum for 100 loop: 2235
Mean: 22ns
Max : 53ns
Min : 21ns

This example shows that the call last at means 20ns which is quite too long for our purpose.
Indeed if we are trying to be more accurate:

#include <iostream>
#include <chrono>
#include <ratio>
#include <climits>
#include <algorithm>    // std::max
int main()
{
  {  
    long long  value = 0;
    double max = LONG_MIN ;
    double min = LONG_MAX;
    for (int i=  1;i<100;i++){
  
      auto startInitial = std::chrono::high_resolution_clock::now();              
      auto endInitial = std::chrono::high_resolution_clock::now();
      std::chrono::duration<double, std::nano > elapsedInitial = (endInitial - startInitial) ;
      max = std::max(max,elapsedInitial.count());
      min = std::min(min,elapsedInitial.count());
      value=value+elapsedInitial.count();
    }
    std::cout <<"Sum for 100 loop"<<value<<" " <<value/100<<"ns"<<std::endl;
    std::cout<<" Max:" <<max <<"ns Min:"<<min<<"ns"<<std::endl;
  }
  std::cout <<"Second function"<<std::endl;
  { 
    long long  value = 0;
    double max = LONG_MIN ;
    double min = LONG_MAX;
    for (int i=  1;i<100;i++){
 
      auto startInitial = std::chrono::high_resolution_clock::now();

      asm("nop");
      asm("nop");
      asm("nop");
      asm("nop");
      asm("nop");
      asm("nop");
      asm("nop");
      asm("nop");
      asm("nop");
      asm("nop");
      asm("nop");
      asm("nop");
      asm("nop");
      asm("nop");
      asm("nop");
      asm("nop");
      asm("nop");
      asm("nop");
      auto endInitial = std::chrono::high_resolution_clock::now();
      std::chrono::duration<double, std::nano > elapsedInitial = (endInitial - startInitial) ;
      max = std::max(max,elapsedInitial.count());
      min = std::min(min,elapsedInitial.count());
      value=value+elapsedInitial.count();
    }
    std::cout <<"Sum for 100 loop"<<value<<" " <<value/100<<"ns"<<std::endl;
    std::cout<<" Max:" <<max <<"ns Min:"<<min<<"ns"<<std::endl;
  }
}