Category: Code Optimization

Testing different implementation of SHA3

29 August 2018

Using the I can now choose which implementation of algorithm we can use for a specific task and how efficient are the optimization provided

The official keccak gives a list of several implementation (they can be found here https://keccak.team/software.html) The aim of this article is to evaluate them, try to optimize them and choose the fastest one.

Feahashmap

The code can be found here https://sourceforge.net/projects/fehashmac/ .

Personal Opinion

Includes 39 encryption algorithms. Code looks clear, well written and efficient. So if you want to understand

Assembly / C / C++ / Code Optimization

Applications and examples for Timing library

14 August 2018

Hi,

as you’ve already know I’ve developed a small library to asset performance of code. The aim of this library is to objectively report how long a function last. It would help you choose an implementation of an algorithm for example.
To illustrate that I will take fibonnaci function as an example. I know at least two implementations of this function, one functional and an another iterative. The strong point of the functional version is that it is easier to implement and to understand as contrary to the iterative. But for large number I suspect the functional version to be very slow but I can not say to what point.

int __inline__ fibonacci(int n)
{
      if (n < 3)
            return 1;
      return fibonacci(n - 1) + fibonacci(n - 2);
}

int __inline__ fibonacciOptim(int n)
{
      int first = 0, second = 1, next, c;
      for (c = 0; c < n; c++)
      {
            if (c <= 1)
                  next = c;
            else
            {
                  next = first + second;
                  first = second;
                  second = next;
            }
      }
      return next;
}

It is now time to use the Timing Library
First clone the repository

git clone https://github.com/fflayol/timingperf.git
cd timingperf

Then create a main to set values of the library and add the different version of your code:

#include "performancetiming.hpp"
int __inline__ fibonacci(int n)
{
      if (n < 3)
            return 1;
      return fibonacci(n - 1) + fibonacci(n - 2);
}

int __inline__ fibonacciOptim(int n)
{
      int first = 0, second = 1, next, c;
      for (c = 0; c < n; c++)
      {
            if (c <= 1)
                  next = c;
            else
            {
                  next = first + second;
                  first = second;
                  second = next;
            }
      }
      return next;
}

int main(int argc, char **argv)
{
      timing::addFunction("FIB", fibonacci);
      timing::addFunction("FIB-OPTIM", fibonacciOptim);
      timing::setTimingFunction(2);
      timing::setNumberExecution(1000000);
      timing::Execute(15);
      timing::CalcResult();
}

The result gives:

fflayol@local:~/Perso/timingperf$ ./ex1.out 
Begin [ FIB ]
End   [ FIB ]
Begin [ FIB-OPTIM ]
End   [ FIB-OPTIM ]
Begin [ FIB ]
End [ FIB ]
Begin [ FIB-OPTIM ]
End [ FIB-OPTIM ]

fflayol@local:~/Perso/timingperf$ ./ex1.out
 Begin [ FIB ] End [ FIB ] 
Begin [ FIB-OPTIM ] End [ FIB-OPTIM ] 
Begin [ FIB ] End [ FIB ] 
Begin [ FIB-OPTIM ] End [ FIB-OPTIM ]

|--------------------------------------------------------------------|
|---Name--------Timer-----Duration------Diff-------Min-------Diff----|
|          |           |           |           |         |           |
|      FIB |     RDTSC |      6326 |      100 %|     5906|     100 % | 
|FIB-OPTIM |     RDTSC |       158 |   4003.8 %|      137| 4310.95 % | 
|      FIB |    RDTSCP |      6354 |  99.5593 %|     5916|  99.831 % | 
|FIB-OPTIM |    RDTSCP |       208 |  3041.35 %|      180| 3281.11 % |

The difference here is very important because our optimization version is 40x faster 🙂
An another nice feature is to check compiler optimization quality. You can change the compiler optimization with -O option followed by a numer (0 to 3, the higher the better).

g++ ex1.cpp -std=c++11 -o ex1.out ; g++ -O3 ex1.cpp -std=c++11 -o ex1-opt.out

|---Name------Timer---Duration------Diff------Min------Diff- |
| FIB      | RDTSC | 2253      | 100 %     | 2080|   100 %   | 
|FIB-OPTIM | RDTSC | 31        | 7267.74 % | 24  | 8666.67 % | 
| FIB      | RDTSCP| 2304      | 97.7865 % | 2108| 98.6717 % | 
|FIB-OPTIM | RDTSCP| 51        | 4417.65 % | 45  | 4622.22 % | |------------------------------------------------------------|

As a result both version are faster (3x for the functional version and 5x with the iterative version). As a result the difference of performance is still higher.

Assembly / C++ / Code Optimization

How to Benchmark Code Execution Times on Intel, the most accurate way

15 March 2018

sudo apt-get install linux-headers-generic
module_init(hello_start);
module_exit(hello_end);
Freely inspired from https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-32-ia-64-benchmark-code-execution-paper.pdf

C++ / Code Optimization / Timer

Evaluation of std::chrono

16 February 2018

If you have a look around in the web, a solution to correctly measure time is to use a new C++ package: std::chrono , which is part of the standard C++ library.
So the aim of this article is to investigate if this solution can be used to have a very high resolution timer. If you remember well as we are doing small improvement we want to be able to measure the improvement (or degradation). of optimization.

First step is to

#include <chrono>
#include <ratio>
#include <climits>
#include <algorithm>    // std::max
int main()
{
  long long  value = 0;
  double max = LONG_MIN ;
  double min = LONG_MAX;
  for (int i=  1;i<100;i++){

    auto startInitial = std::chrono::high_resolution_clock::now();              
    auto endInitial = std::chrono::high_resolution_clock::now();
    std::chrono::duration<double, std::nano > elapsedInitial = (endInitial - startInitial) ;
    max = std::max(max,elapsedInitial.count());
    min = std::min(min,elapsedInitial.count());
    value=value+elapsedInitial.count();
  }
  std::cout <<"Sum for 100 loop"<<value<<" " <<value/100<<"ns"std::endl;
  std::cout<<" Max:" <<max <<"ns Min:"<<min<<"ns"<<std::endl;
}

fflayol@:/tmp$ g++ test1.c  -std=c++11;./a.out 
Sum for 100 loop: 2235
Mean: 22ns
Max : 53ns
Min : 21ns

This example shows that the call last at means 20ns which is quite too long for our purpose.
Indeed if we are trying to be more accurate:

#include <iostream>
#include <chrono>
#include <ratio>
#include <climits>
#include <algorithm>    // std::max
int main()
{
  {  
    long long  value = 0;
    double max = LONG_MIN ;
    double min = LONG_MAX;
    for (int i=  1;i<100;i++){
  
      auto startInitial = std::chrono::high_resolution_clock::now();              
      auto endInitial = std::chrono::high_resolution_clock::now();
      std::chrono::duration<double, std::nano > elapsedInitial = (endInitial - startInitial) ;
      max = std::max(max,elapsedInitial.count());
      min = std::min(min,elapsedInitial.count());
      value=value+elapsedInitial.count();
    }
    std::cout <<"Sum for 100 loop"<<value<<" " <<value/100<<"ns"<<std::endl;
    std::cout<<" Max:" <<max <<"ns Min:"<<min<<"ns"<<std::endl;
  }
  std::cout <<"Second function"<<std::endl;
  { 
    long long  value = 0;
    double max = LONG_MIN ;
    double min = LONG_MAX;
    for (int i=  1;i<100;i++){
 
      auto startInitial = std::chrono::high_resolution_clock::now();

      asm("nop");
      asm("nop");
      asm("nop");
      asm("nop");
      asm("nop");
      asm("nop");
      asm("nop");
      asm("nop");
      asm("nop");
      asm("nop");
      asm("nop");
      asm("nop");
      asm("nop");
      asm("nop");
      asm("nop");
      asm("nop");
      asm("nop");
      asm("nop");
      auto endInitial = std::chrono::high_resolution_clock::now();
      std::chrono::duration<double, std::nano > elapsedInitial = (endInitial - startInitial) ;
      max = std::max(max,elapsedInitial.count());
      min = std::min(min,elapsedInitial.count());
      value=value+elapsedInitial.count();
    }
    std::cout <<"Sum for 100 loop"<<value<<" " <<value/100<<"ns"<<std::endl;
    std::cout<<" Max:" <<max <<"ns Min:"<<min<<"ns"<<std::endl;
  }
}

Code Optimization / Software / Timer

Std Chrono, a high resolution timer ?

15 February 2018

#include <iostream>
#include <string>
#include <vector>
#include <functional>
#include <chrono>
#include <smmintrin.h>
#include <unistd.h>
#include <glm.hpp>
#include <gtx/simd_vec4.hpp>
#include <gtx/simd_mat4.hpp>
#include <gtc/type_ptr.hpp>
#include <immintrin.h>


namespace ch = std::chrono;

const int Iter = 1<<28;

void RunBench_GLM()
{
	glm::vec4 v(1.0f);
	glm::vec4 v2;
	glm::mat4 m(1.0f);
	
	for (int i = 0; i < Iter; i++)
	{
		v2 += m * v;
	}

	auto t = v2;
	std::cout << t.x << " " << t.y << " " << t.z << " " << t.w << std::endl;
}


void RunBench_GLM_SIMD()
{
	glm::detail::fvec4SIMD v(1.0f);
	glm::detail::fvec4SIMD v2(0.0f);
	glm::detail::fmat4x4SIMD m(1.0f);

	for (int i = 0; i < Iter; i++)
	{
		v2 += v * m;
	}

	auto t = glm::vec4_cast(v2);
	std::cout << t.x << " " << t.y << " " << t.z << " " << t.w << std::endl;
}



void RunBench_Double_GLM()
{
	glm::dvec4 v(1.0);
	glm::dvec4 v2;
	glm::dmat4 m(1.0);

	for (int i = 0; i < Iter; i++)
	{
		v2 += v * m;
	}

	auto t = v2;
	std::cout << t.x << " " << t.y << " " << t.z << " " << t.w << std::endl;
}

void RunBench_Double_AVX()
{
	__m256d v = _mm256_set_pd(1, 1, 1, 1);
	__m256d s = _mm256_setzero_pd();
	__m256d m[4] =
	{
		_mm256_set_pd(1, 0, 0, 0),
		_mm256_set_pd(0, 1, 0, 0),
		_mm256_set_pd(0, 0, 1, 0),
		_mm256_set_pd(0, 0, 0, 1)
	};

	for (int i = 0; i < Iter; i++)
	{
		__m256d v0 = _mm256_shuffle_pd(v, v, _MM_SHUFFLE(0, 0, 0, 0));
		__m256d v1 = _mm256_shuffle_pd(v, v, _MM_SHUFFLE(1, 1, 1, 1));
		__m256d v2 = _mm256_shuffle_pd(v, v, _MM_SHUFFLE(2, 2, 2, 2));
		__m256d v3 = _mm256_shuffle_pd(v, v, _MM_SHUFFLE(3, 3, 3, 3));

		__m256d m0 = _mm256_mul_pd(m[0], v0);
		__m256d m1 = _mm256_mul_pd(m[1], v1);
		__m256d m2 = _mm256_mul_pd(m[2], v2);
		__m256d m3 = _mm256_mul_pd(m[3], v3);

		__m256d a0 = _mm256_add_pd(m0, m1);
		__m256d a1 = _mm256_add_pd(m2, m3);
		__m256d a2 = _mm256_add_pd(a0, a1);

		s = _mm256_add_pd(s, a2);
	}

	double t[4];
	_mm256_store_pd(t, s);
	std::cout << t[0] << " " << t[1] << " " << t[2] << " " << t[3] << std::endl;
}

int main()
{
	std::vector<std::pair<std::string, std::function<void ()>>> benches;
	benches.push_back(std::make_pair("GLM", RunBench_GLM));
	benches.push_back(std::make_pair("GLM_SIMD", RunBench_GLM_SIMD));
	benches.push_back(std::make_pair("Double_GLM", RunBench_Double_GLM));
	benches.push_back(std::make_pair("Double_AVX", RunBench_Double_AVX));
	auto startInitial = ch::high_resolution_clock::now();
        
        for (int i=0;i<500000;i++){
          asm("NOP");
        }
        
        auto endInitial = ch::high_resolution_clock::now();

	double elapsedInitial = (double)ch::duration_cast<ch::milliseconds>(endInitial - startInitial).count() ;
	std::cout << "resolution :" <<elapsedInitial <<std::endl;
	for (auto& bench : benches)
	{
		std::cout << "Begin [ " << bench.first << " ]" << std::endl;

		auto start = ch::high_resolution_clock::now();
		bench.second();
		auto end = ch::high_resolution_clock::now();	

		double elapsed = (double)ch::duration_cast<ch::milliseconds>(end - start).count() / 1000.0;
		std::cout << "End [ " << bench.first << " ] : " << elapsed << " seconds" << std::endl;
	}
	
	std::cin.get();
	return 0;
}

Assembly / C / Code Optimization

Shuffle in SSE

15 February 2018

#include <immintrin.h>
#include <stdio.h>
void test(int32_t *Y, int32_t *X)
{
        __m128i *v1  __attribute__((aligned (16)));

        __m128i *v2 __attribute__((aligned (16)));
        __m128i v3 __attribute__((aligned (16)));
        __m128i  v4 __attribute__((aligned (16)));
        int32_t * rslt;
        int64_t * rslt64;
        v1 = (__m128i *) X;
        v2 = (__m128i *) Y;
        rslt = (int32_t * ) v1;
      printf("In test, V1  after MUL  SHUFFLE: %d\t%d\t%d\t%d\t\n", rslt[0], rslt[1], rslt[2], rslt[3]);
      rslt = (int32_t * ) v2;
      printf("In test, V2 before  MUL SHUFFLE: %d\t%d\t%d\t%d\t\n", rslt[0], rslt[1], rslt[2], rslt[3]);
        v3 =  _mm_mul_epi32(*v1, *v2);
        v4 = _mm_mul_epi32(_mm_shuffle_epi32(*v1, _MM_SHUFFLE(2, 3, 0, 1)), _mm_shuffle_epi32(*v2, _MM_SHUFFLE(2, 3, 0, 1)));

        rslt64 = (int64_t * ) &v3;
        printf("In REDC, product before SHUFFLE: %ldt%ldn", rslt64[0], rslt64[1]);

        rslt64 = (int64_t * ) &v4;
        printf("In REDC, product after  SHUFFLE: %ldt%ldn", rslt64[0], rslt64[1]);

        rslt = (int32_t * ) v1;
        printf("In REDC, 4-way vect before  SHUFFLE: %dt%dt%dt%dtn", rslt[0], rslt[1], rslt[2], rslt[3]);

        *v1 = _mm_shuffle_epi32(*v1, _MM_SHUFFLE(2, 3, 0, 1));
        rslt = (int32_t * ) v1;
        printf("In REDC, 4-way vect after  SHUFFLE: %dt%dt%dt%dtn", rslt[0], rslt[1], rslt[2], rslt[3]);
}
int main (int nb, char** argv){
   int32_t a = (int32_t)1234;
   int32_t b = (int32_t)5678;
    test(&a,&b);
}

Assembly / C / Code Optimization / Software

Quest for the ultimate timer framework

12 February 2018

A lot stuff on this blog talks about code optimization, and sometime very small improvement that have performance minimal impacts but in case they are called a lot of time it became difficult to ensure optimization are useful. Let’s take an example. Imagine you have a function that lasts one millisecond. You are optimizing this function and as a result you found two solutions to optimize your code . But if you are using a timer that lasts 0.5 milliseconds, you won’t be able to choose one of this other. The aim of this article is to help you to understand ???

Throughput of the algorithm

In this case

Pro:

Easy to implement
Can cover a full range of service, or the lifetime

Con

Can be difficult to implement as you invert the timing (aka how many cycles I’ve done in one minute for instance)
Initial conditions can be impossible to reproduce
Program should maintain this feature

Example:

Our ethminer with -m option.

Time with linux command

The time command, is an Unix/Linux standard line command. It will use the internal timer to return the time elapsed by the command.

time ls -l 

real	0m0.715s
user	0m0.000s
sys	0m0.004s

You know that in this blog I like to use example. Imagine you have a 3D application that do a lot complex mathematical calculus function (square root, cosines,..). You know that theses functions are called a lot of time (billion per second). As we have already seen a small improvement in this function can have very strong impact on all the program. Now you know two way to implement this calculus in C by using the standard mathematic library or in assembler which is a bit complex to do but you might achieve better performance. The method I’m gonna to present can be also used when you have two implementations of the same feature and you don’t know which one to choose. If you have to choose how fast is the new version, and do it worth the pain to do it in assembler as the code becomes difficult to maintain.

#include <math.h>
inline double Calc_c(double x,double y,double z){
double tmpx = sqrt(x)*cos(x)/sin(x);
double tmpy = sqrt(y)*cos(y)/sin(y);
double tmpz = sqrt(z)*cos(z)/sin(z);
return (tmpx+tmpy+tmpz)*tmpx+tmpy+tmpz
}

inline double Calc_as(double x,double y,double z){

    __m512d a1  _mm512_set4_pd(x,y,z,0.0);
    
}

We know that the assembler version will be faster but to which value ?

Assembly / C / Code Optimization / Ethereum

SHA3

9 February 2018

Preimage attack on Keccak-512 reduced to 8 rounds, requiring 2511.5 time and 2508 memory[2]
Zero-sum distinguishers exist for the full 24-round Keccak-f[1600], though they cannot be used to attack the hash function itself[3]

SHA-3 (Secure Hash Algorithm 3) is the latest member of the Secure Hash Algorithm family of standards, released by NIST on August 5, 2015.[4][5] The reference implementation source code was dedicated to public domain via CC0 waiver.[6] Although part of the same series of standards, SHA-3 is internally quite different from the MD5-like structure of SHA-1 and SHA-2.

SHA-3 is a subset of the broader cryptographic primitive family Keccak (/ˈkɛtʃæk/, or /ˈkɛtʃɑːk/),[7][8] designed by Guido Bertoni, Joan Daemen, Michaël Peeters, and Gilles Van Assche,

Keccak is based on a novel approach called sponge construction. Sponge construction is based on a wide random function or random permutation, and allows inputting (“absorbing” in sponge terminology) any amount of data, and outputting (“squeezing”) any amount of data, while acting as a pseudorandom function with regard to all previous inputs. This leads to great flexibility.

SHA-3 uses the sponge construction,[13][23] in which data is “absorbed” into the sponge, then the result is “squeezed” out. In the absorbing phase, message blocks are XORed into a subset of the state, which is then transformed as a whole using a permutation function f. In the “squeeze” phase, output blocks are read from the same subset of the state, alternated with the state transformation function f. The size of the part of the state that is written and read is called the “rate” (denoted r), and the size of the part that is untouched by input/output is called the “capacity” (denoted c). The capacity determines the security of the scheme. The maximum security level is half the capacity.

Given an input bit string N, a padding function pad, a permutation function f that operates on bit blocks of width b, a rate r and an output length d, we have capacity c = b − r and the sponge construction Z = sponge[f,pad,r](N,d), yielding a bit string Z of length d, works as follows:[24]:18

pad the input N using the pad function, yielding a padded bit string P with a length divisible by r (such that n = len(P)/r is integer),
break P into n consecutive r-bit pieces P0, …, Pn-1
initialize the state S to a string of b 0 bits.
absorb the input into the state: For each block Pi,
extend Pi at the end by a string of c 0 bits, yielding one of length b,
XOR that with S and
apply the block permutation f to the result, yielding a new state S
initialize Z to be the empty string
while the length of Z is less than d:
append the first r bits of S to Z
if Z is still less than d bits long, apply f to S, yielding a new state S.
truncate Z to d bits

The fact that the internal state S contains c additional bits of information in addition to what is output to Z prevents the length extension attacks that SHA-2, SHA-1, MD5 and other hashes based on the Merkle–Damgård construction are susceptible to.

In SHA-3, the state S consists of a 5 × 5 array of w = 64-bit words, b = 5 × 5 × w = 5 × 5 × 64 = 1600 bits total. Keccak is also defined for smaller power-of-2 word sizes w down to 1 bit (25 bits total state). Small state sizes can be used to test cryptanalytic attacks, and intermediate state sizes (from w = 8, 200 bits, to w = 32, 800 bits) can be used in practical, lightweight applications.[11][12]

For SHA-3-224, SHA-3-256, SHA-3-384, and SHA-3-512 instances, r is greater than d, so there is no need for additional block permutations in the squeezing phase; the leading d bits of the state are the desired hash. However, SHAKE-128 and SHAKE-256 allow an arbitrary output length, which is useful in applications such as optimal asymmetric encryption padding.
Padding

To ensure the message can be evenly divided into r-bit blocks, padding is required. SHA-3 uses the pattern 10*1 in its padding function: a 1 bit, followed by zero or more 0 bits (maximum r − 1) and a final 1 bit.

The maximum of r − 1 0 bits occurs when the last message block is r − 1 bits long. Then another block is added after the initial 1 bit, containing r − 1 0 bits before the final 1 bit.

The two 1 bits will be added even if the length of the message is already divisible by r.[24]:5.1 In this case, another block is added to the message, containing a 1 bit, followed by a block of r − 2 0 bits and another 1 bit. This is necessary so that a message with length divisible by r ending in something that looks like padding does not produce the same hash as the message with those bits removed.

The initial 1 bit is required so messages differing only in a few additional 0 bits at the end do not produce the same hash.

The position of the final 1 bit indicates which rate r was used (multi-rate padding), which is required for the security proof to work for different hash variants. Without it, different hash variants of the same short message would be the same up to truncation.
The block permutation

The block transformation f, which is Keccak-f[1600] for SHA-3, is a permutation that uses xor, and and not operations, and is designed for easy implementation in both software and hardware.

It is defined for any power-of-two word size, w = 2ℓ bits. The main SHA-3 submission uses 64-bit words, ℓ = 6.

The state can be considered to be a 5 × 5 × w array of bits. Let a[i][ j][k] be bit (5i + j) × w + k of the input, using a little-endian bit numbering convention and row-major indexing. I.e. i selects the row, j the column, and k the bit.

Index arithmetic is performed modulo 5 for the first two dimensions and modulo w for the third.

The basic block permutation function consists of 12 + 2ℓ rounds of five steps, each individually very simple:

θ
Compute the parity of each of the 5w (320, when w = 64) 5-bit columns, and exclusive-or that into two nearby columns in a regular pattern. To be precise, a[i][ j][k] ← a[i][ j][k] ⊕ parity(a[0…4][ j−1][k]) ⊕ parity(a[0…4][ j+1][k−1])
ρ
Bitwise rotate each of the 25 words by a different triangular number 0, 1, 3, 6, 10, 15, …. To be precise, a[0][0] is not rotated, and for all 0 ≤ t < 24, a[i][ j][k] ← a[i][ j][k−(t+1)(t+2)/2], where ( i j ) = ( 3 2 1 0 ) t ( 0 1 ) {\displaystyle {\begin{pmatrix}i\\j\end{pmatrix}}={\begin{pmatrix}3&2\\1&0\end{pmatrix}}^{t}{\begin{pmatrix}0\\1\end{pmatrix}}} {\begin{pmatrix}i\\j\end{pmatrix}}={\begin{pmatrix}3&2\\1&0\end{pmatrix}}^{t}{\begin{pmatrix}0\\1\end{pmatrix}}. π Permute the 25 words in a fixed pattern. a[j][2i+3j] ← a[ i][j]. χ Bitwise combine along rows, using x ← x ⊕ (¬y & z). To be precise, a[i][ j][k] ← a[i][ j][k] ⊕ (¬a[i][ j+1][k] & a[i][ j+2][k]). This is the only non-linear operation in SHA-3. ι Exclusive-or a round constant into one word of the state. To be precise, in round n, for 0 ≤ m ≤ ℓ, a[0][0][2m−1] is exclusive-ORed with bit m + 7n of a degree-8 LFSR sequence. This breaks the symmetry that is preserved by the other steps. Speed The speed of SHA-3 hashing of long messages is dominated by the computation of f = Keccak-f[1600] and XORing S with the extended Pi, an operation on b = 1600 bits. However, since the last c bits of the extended Pi are 0 anyway, and XOR with 0 is a noop, it is sufficient to perform XOR operations only for r bits (r = 1600 − 2 × 224 = 1152 bits for SHA3-224, 1088 bits for SHA3-256, 832 bits for SHA3-384 and 576 bits for SHA3-512). The lower r is (and, conversely, the higher c = b − r = 1600 − r), the less efficient but more secure the hashing becomes since fewer bits of the message can be XORed into the state (a quick operation) before each application of the computationally expensive f. Keccak[c](N, d) = sponge[Keccak-f[1600], pad10*1, r](N, d)[24]:20 Keccak-f[1600] = Keccak-p[1600, 24][24]:17 c is the capacity r is the rate = 1600 − c N is the input bit string

Assembly / C / Code Optimization / Sse tutorial

SSE usage Tutorial

30 January 2018

Hi all,

SSE usage is a bit tricky

You have to see see registers as vectorial not a linear and their size is depending of the context.

For instance xmm0 (a SSE 128 bytes register)can be seen as 2*64 bits register or 4*32 bits register or 8*16 bits register or 16*8 bits register.

So what is the aim of this ?

If you want to add two arrays the algorithm wil be

int x [4] ; //let's say that sizeof(int) =32
int y[4];
int z[4];

c[0]=a[0]b[0]
c[1]=a[1]+b[1]
c[2]=a[2]+b[2]
c[3]=a[3]+b[3]

In assembler it gives :

       .loc 1 6 0
        mov     edx, DWORD PTR [rbp-48]
        mov     eax, DWORD PTR [rbp-32]
        add     eax, edx
        mov     DWORD PTR [rbp-16], eax
        .loc 1 7 0
        mov     edx, DWORD PTR [rbp-44]
        mov     eax, DWORD PTR [rbp-28]
        add     eax, edx
        mov     DWORD PTR [rbp-12], eax
        .loc 1 8 0
        mov     edx, DWORD PTR [rbp-40]
        mov     eax, DWORD PTR [rbp-24]
        add     eax, edx
        mov     DWORD PTR [rbp-8], eax
        .loc 1 9 0
        mov     edx, DWORD PTR [rbp-36]
        mov     eax, DWORD PTR [rbp-20]
        add     eax, edx
        mov     DWORD PTR [rbp-4], eax
        mov     eax, 0

For instance c[0] = a[0]+b[0] is generated like that:

       .loc 1 6 0
        mov     edx, DWORD PTR [rbp-48] ; a[0]
        mov     eax, DWORD PTR [rbp-32] ; b[0]
        add     eax, edx ;eax <- a[0]+b[0]
        mov     DWORD PTR [rbp-16], eax ; c[0] = eaz

And we are doing that 4 times. But thanks to the SSE extension operator we can do it with less instructions

The Streaming SIMD Extensions enhance the x86 architecture in four ways:

8 new 128-bit SIMD floating-point registers that can be directly addressed;
50 new instructions that work on packed floating-point data;
8 new instructions designed tocontrol cacheability of all MMX and 32-bit x86 data types, including the ability to stream data to memory without polluting the caches, and to prefetch data before it is actually used;
12 new instructions that extend the instruction set.

This set enables the programmer to develop algorithms that can mix packed, single-precision, floating-point and integer using both SSE and MMX instructions respectively.

Intel SSE provides eight 128-bit general-purpose registers, each of which can be directly addressed using the register names XMM0 to XMM7. Each register consists of four 32-bit single precision, floating-point numbers, numbered 0 through 3.

SSE instructions operate on either all or the least significant pairs of packed data operands in parallel. The packed instructions (with PS suffix) operate on a pair of operands, while scalar instructions (with SS suffix) always operate on the least significant pair of the two operands; for scalar operations, the three upper components from the first operand are passed through to the destination.

There are two ways to use SSE registers

Scalar the same 4 instructions on 4 datas

Packed

(thanks to Stefano Tommesani)

So let’s return to our code. I think you gonna understand where I want to go. I we fill two registers with 4 values (a[0]..a[3]) in one register and (c[0]..c[3]), add them together and put the result in a third register. With this solution we will do only one addition.

#include 
#include 
#include 

void p128_hex_u8(__m128i in) {
    uint8_t v[16];
    _mm_store_si128((__m128i*)v, in);
    printf("v16_u8: %x %x %x %x | %x %x %x %x | %x %x %x %x | %x %x %x %xn",
           v[0], v[1],  v[2],  v[3],  v[4],  v[5],  v[6],  v[7],
           v[8], v[9], v[10], v[11], v[12], v[13], v[14], v[15]);
}
 
void p128_hex_u16(__m128i in) {
    uint16_t v[8];
    _mm_store_si128((__m128i*)v, in);
    printf("v8_u16: %x %x %x %x,  %x %x %x %xn", v[0], v[1], v[2], v[3], v[4], v[5], v[6], v[7]);
}
 
void p128_hex_u32(__m128i in) {
    uint32_t v[4] __attribute__((aligned (16)));
    _mm_store_si128((__m128i*)v, in);
    printf("v4_u32: %x %x %x %xn", v[0], v[1], v[2], v[3]);
}
 
void p128_dec_u32(__m128i in) {
    uint32_t v[4] __attribute__((aligned (16)));
    _mm_store_si128((__m128i*)v, in);
    printf("v4_u32: %d %d %d %dn",(uint32_t) v[0], (uint32_t) v[1], (uint32_t)v[2],(uint32_t) v[3]);
}

void p128_hex_u64(__m128i in) {
    long long v[2];  // uint64_t might give format-string warnings with %llx; it's just long in some ABIs
    _mm_store_si128((__m128i*)v, in);
    printf("v2_u64: %llx %llxn", v[0], v[1]);
}
 
 
int main(){
uint32_t a [4] ={1,2,3,4}; //let's say that sizeof(int) = 32
uint32_t b[4] = {11,12,13,14};
uint32_t c[4];
 
c[0]=a[0]+b[0];
c[1]=a[1]+b[1];
c[2]=a[2]+b[2];
c[3]=a[3]+b[3];
printf("Result %d %d %d %dn",c[0],c[1],c[2],c[3]);
 
     __m128i a1 = _mm_set_epi32(a[3], a[2], a[1], a[0]);
     __m128i b1 = _mm_set_epi32(b[3], b[2], b[1], b[0]);
     __m128i c1 = _mm_add_epi32(a1, b1);
     p128_dec_u32(a1);
     p128_dec_u32(b1);
     p128_dec_u32(c1);
}

This a very simple example, as your compiler can already optimize your code with this

C / Code Optimization / Ethereum

Ethminer Optimization part 2

26 January 2018

In the previous article we started sha* functions optimizations, now as we previously seen a large bottleneck performance is in internal.c.

/*
  This file is part of ethash.

  ethash is free software: you can redistribute it and/or modify
  it under the terms of the GNU General Public License as published by
  the Free Software Foundation, either version 3 of the License, or
  (at your option) any later version.

  ethash is distributed in the hope that it will be useful,
  but WITHOUT ANY WARRANTY; without even the implied warranty of
  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.	See the
  GNU General Public License for more details.

  You should have received a copy of the GNU General Public License
  along with cpp-ethereum.	If not, see <http://www.gnu.org/licenses/>.
*/
/** @file internal.c
* @author Tim Hughes <tim@twistedfury.com>
* @author Matthew Wampler-Doty
* @date 2015
*/

#include <assert.h>
#include <inttypes.h>
#include <stddef.h>
#include <errno.h>
#include <math.h>
#include "mmap.h"
#include "ethash.h"
#include "fnv.h"
#include "endian.h"
#include "internal.h"
#include "data_sizes.h"
#include "io.h"

#ifdef WITH_CRYPTOPP

#include "sha3_cryptopp.h"

#else
#include "sha3.h"
#endif // WITH_CRYPTOPP

uint64_t ethash_get_datasize(uint64_t const block_number)
{
	assert(block_number / ETHASH_EPOCH_LENGTH < 2048);
	return dag_sizes[block_number / ETHASH_EPOCH_LENGTH];
}

uint64_t ethash_get_cachesize(uint64_t const block_number)
{
	assert(block_number / ETHASH_EPOCH_LENGTH < 2048);
	return cache_sizes[block_number / ETHASH_EPOCH_LENGTH];
}

// Follows Sergio's "STRICT MEMORY HARD HASHING FUNCTIONS" (2014)
// https://bitslog.files.wordpress.com/2013/12/memohash-v0-3.pdf
// SeqMemoHash(s, R, N)
static bool ethash_compute_cache_nodes(
	node* const nodes,
	uint64_t cache_size,
	ethash_h256_t const* seed
)
{
	if (cache_size % sizeof(node) != 0) {
		return false;
	}
	uint32_t const num_nodes = (uint32_t) (cache_size / sizeof(node));

	SHA3_512(nodes[0].bytes, (uint8_t*)seed, 32);

	for (uint32_t i = 1; i != num_nodes; ++i) {
		SHA3_512(nodes[i].bytes, nodes[i - 1].bytes, 64);
	}

	for (uint32_t j = 0; j != ETHASH_CACHE_ROUNDS; j++) {
		for (uint32_t i = 0; i != num_nodes; i++) {
			uint32_t const idx = nodes[i].words[0] % num_nodes;
			node data;
			data = nodes[(num_nodes - 1 + i) % num_nodes];
			for (uint32_t w = 0; w != NODE_WORDS; ++w) {
				data.words[w] ^= nodes[idx].words[w];
			}
			SHA3_512(nodes[i].bytes, data.bytes, sizeof(data));
		}
	}

	// now perform endian conversion
	fix_endian_arr32(nodes->words, num_nodes * NODE_WORDS);
	return true;
}

void ethash_calculate_dag_item(
	node* const ret,
	uint32_t node_index,
	ethash_light_t const light
)
{
	uint32_t num_parent_nodes = (uint32_t) (light->cache_size / sizeof(node));
	node const* cache_nodes = (node const *) light->cache;
	node const* init = &cache_nodes[node_index % num_parent_nodes];
	memcpy(ret, init, sizeof(node));
	ret->words[0] ^= node_index;
	SHA3_512(ret->bytes, ret->bytes, sizeof(node));
#if defined(_M_X64) && ENABLE_SSE
	__m128i const fnv_prime = _mm_set1_epi32(FNV_PRIME);
	__m128i xmm0 = ret->xmm[0];
	__m128i xmm1 = ret->xmm[1];
	__m128i xmm2 = ret->xmm[2];
	__m128i xmm3 = ret->xmm[3];
#elif defined(__MIC__)
	__m512i const fnv_prime = _mm512_set1_epi32(FNV_PRIME);
	__m512i zmm0 = ret->zmm[0];
#endif

	for (uint32_t i = 0; i != ETHASH_DATASET_PARENTS; ++i) {
		uint32_t parent_index = fnv_hash(node_index ^ i, ret->words[i % NODE_WORDS]) % num_parent_nodes;
		node const *parent = &cache_nodes[parent_index];

#if defined(_M_X64) && ENABLE_SSE
		{
			xmm0 = _mm_mullo_epi32(xmm0, fnv_prime);
			xmm1 = _mm_mullo_epi32(xmm1, fnv_prime);
			xmm2 = _mm_mullo_epi32(xmm2, fnv_prime);
			xmm3 = _mm_mullo_epi32(xmm3, fnv_prime);
			xmm0 = _mm_xor_si128(xmm0, parent->xmm[0]);
			xmm1 = _mm_xor_si128(xmm1, parent->xmm[1]);
			xmm2 = _mm_xor_si128(xmm2, parent->xmm[2]);
			xmm3 = _mm_xor_si128(xmm3, parent->xmm[3]);

			// have to write to ret as values are used to compute index
			ret->xmm[0] = xmm0;
			ret->xmm[1] = xmm1;
			ret->xmm[2] = xmm2;
			ret->xmm[3] = xmm3;
		}
		#elif defined(__MIC__)
		{
			zmm0 = _mm512_mullo_epi32(zmm0, fnv_prime);

			// have to write to ret as values are used to compute index
			zmm0 = _mm512_xor_si512(zmm0, parent->zmm[0]);
			ret->zmm[0] = zmm0;
		}
		#else
		{
			for (unsigned w = 0; w != NODE_WORDS; ++w) {
				ret->words[w] = fnv_hash(ret->words[w], parent->words[w]);
			}
		}
#endif
	}
	SHA3_512(ret->bytes, ret->bytes, sizeof(node));
}

bool ethash_compute_full_data(
	void* mem,
	uint64_t full_size,
	ethash_light_t const light,
	ethash_callback_t callback
)
{
	if (full_size % (sizeof(uint32_t) * MIX_WORDS) != 0 ||
		(full_size % sizeof(node)) != 0) {
		return false;
	}
	uint32_t const max_n = (uint32_t)(full_size / sizeof(node));
	node* full_nodes = mem;
	double const progress_change = 1.0f / max_n;
	double progress = 0.0f;
	// now compute full nodes
	for (uint32_t n = 0; n != max_n; ++n) {
		if (callback &&
			n % (max_n / 100) == 0 &&
			callback((unsigned int)(ceil(progress * 100.0f))) != 0) {

			return false;
		}
		progress += progress_change;
		ethash_calculate_dag_item(&(full_nodes[n]), n, light);
	}
	return true;
}

static bool ethash_hash(
	ethash_return_value_t* ret,
	node const* full_nodes,
	ethash_light_t const light,
	uint64_t full_size,
	ethash_h256_t const header_hash,
	uint64_t const nonce
)
{
	if (full_size % MIX_WORDS != 0) {
		return false;
	}

	// pack hash and nonce together into first 40 bytes of s_mix
	assert(sizeof(node) * 8 == 512);
	node s_mix[MIX_NODES + 1];
	memcpy(s_mix[0].bytes, &header_hash, 32);
	fix_endian64(s_mix[0].double_words[4], nonce);

	// compute sha3-512 hash and replicate across mix
	SHA3_512(s_mix->bytes, s_mix->bytes, 40);
	fix_endian_arr32(s_mix[0].words, 16);

	node* const mix = s_mix + 1;
	for (uint32_t w = 0; w != MIX_WORDS; ++w) {
		mix->words[w] = s_mix[0].words[w % NODE_WORDS];
	}

	unsigned const page_size = sizeof(uint32_t) * MIX_WORDS;
	unsigned const num_full_pages = (unsigned) (full_size / page_size);

	for (unsigned i = 0; i != ETHASH_ACCESSES; ++i) {
		uint32_t const index = fnv_hash(s_mix->words[0] ^ i, mix->words[i % MIX_WORDS]) % num_full_pages;

		for (unsigned n = 0; n != MIX_NODES; ++n) {
			node const* dag_node;
			node tmp_node;
			if (full_nodes) {
				dag_node = &full_nodes[MIX_NODES * index + n];
			} else {
				ethash_calculate_dag_item(&tmp_node, index * MIX_NODES + n, light);
				dag_node = &tmp_node;
			}

#if defined(_M_X64) && ENABLE_SSE
			{
				__m128i fnv_prime = _mm_set1_epi32(FNV_PRIME);
				__m128i xmm0 = _mm_mullo_epi32(fnv_prime, mix[n].xmm[0]);
				__m128i xmm1 = _mm_mullo_epi32(fnv_prime, mix[n].xmm[1]);
				__m128i xmm2 = _mm_mullo_epi32(fnv_prime, mix[n].xmm[2]);
				__m128i xmm3 = _mm_mullo_epi32(fnv_prime, mix[n].xmm[3]);
				mix[n].xmm[0] = _mm_xor_si128(xmm0, dag_node->xmm[0]);
				mix[n].xmm[1] = _mm_xor_si128(xmm1, dag_node->xmm[1]);
				mix[n].xmm[2] = _mm_xor_si128(xmm2, dag_node->xmm[2]);
				mix[n].xmm[3] = _mm_xor_si128(xmm3, dag_node->xmm[3]);
			}
			#elif defined(__MIC__)
			{
				// __m512i implementation via union
				//	Each vector register (zmm) can store sixteen 32-bit integer numbers
				__m512i fnv_prime = _mm512_set1_epi32(FNV_PRIME);
				__m512i zmm0 = _mm512_mullo_epi32(fnv_prime, mix[n].zmm[0]);
				mix[n].zmm[0] = _mm512_xor_si512(zmm0, dag_node->zmm[0]);
			}
			#else
			{
				for (unsigned w = 0; w != NODE_WORDS; ++w) {
					mix[n].words[w] = fnv_hash(mix[n].words[w], dag_node->words[w]);
				}
			}
#endif
		}

	}

// Workaround for a GCC regression which causes a bogus -Warray-bounds warning.
// The regression was introduced in GCC 4.8.4, fixed in GCC 5.0.0 and backported to GCC 4.9.3 but
// never to the GCC 4.8.x line.
//
// See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=56273
//
// This regression is affecting Debian Jesse (8.5) builds of cpp-ethereum (GCC 4.9.2) and also
// manifests in the doublethinkco armel v5 cross-builds, which use crosstool-ng and resulting
// in the use of GCC 4.8.4.  The Tizen runtime wants an even older GLIBC version - the one from
// GCC 4.6.0!

#if defined(__GNUC__) && (__GNUC__ < 5)
#pragma GCC diagnostic push
#pragma GCC diagnostic ignored "-Warray-bounds"
#endif // define (__GNUC__)

	// compress mix
	for (uint32_t w = 0; w != MIX_WORDS; w += 4) {
		uint32_t reduction = mix->words[w + 0];
		reduction = reduction * FNV_PRIME ^ mix->words[w + 1];
		reduction = reduction * FNV_PRIME ^ mix->words[w + 2];
		reduction = reduction * FNV_PRIME ^ mix->words[w + 3];
		mix->words[w / 4] = reduction;
	}

#if defined(__GNUC__) && (__GNUC__ < 5)
#pragma GCC diagnostic pop
#endif // define (__GNUC__)

	fix_endian_arr32(mix->words, MIX_WORDS / 4);
	memcpy(&ret->mix_hash, mix->bytes, 32);
	// final Keccak hash
	SHA3_256(&ret->result, s_mix->bytes, 64 + 32); // Keccak-256(s + compressed_mix)
	return true;
}

void ethash_quick_hash(
	ethash_h256_t* return_hash,
	ethash_h256_t const* header_hash,
	uint64_t const nonce,
	ethash_h256_t const* mix_hash
)
{
	uint8_t buf[64 + 32];
	memcpy(buf, header_hash, 32);
	fix_endian64_same(nonce);
	memcpy(&(buf[32]), &nonce, 8);
	SHA3_512(buf, buf, 40);
	memcpy(&(buf[64]), mix_hash, 32);
	SHA3_256(return_hash, buf, 64 + 32);
}

ethash_h256_t ethash_get_seedhash(uint64_t block_number)
{
	ethash_h256_t ret;
	ethash_h256_reset(&ret);
	uint64_t const epochs = block_number / ETHASH_EPOCH_LENGTH;
	for (uint32_t i = 0; i < epochs; ++i)
		SHA3_256(&ret, (uint8_t*)&ret, 32);
	return ret;
}

bool ethash_quick_check_difficulty(
	ethash_h256_t const* header_hash,
	uint64_t const nonce,
	ethash_h256_t const* mix_hash,
	ethash_h256_t const* boundary
)
{

	ethash_h256_t return_hash;
	ethash_quick_hash(&return_hash, header_hash, nonce, mix_hash);
	return ethash_check_difficulty(&return_hash, boundary);
}

ethash_light_t ethash_light_new_internal(uint64_t cache_size, ethash_h256_t const* seed)
{
	struct ethash_light *ret;
	ret = calloc(sizeof(*ret), 1);
	if (!ret) {
		return NULL;
	}
#if defined(__MIC__)
	ret->cache = _mm_malloc((size_t)cache_size, 64);
#else
	ret->cache = malloc((size_t)cache_size);
#endif
	if (!ret->cache) {
		goto fail_free_light;
	}
	node* nodes = (node*)ret->cache;
	if (!ethash_compute_cache_nodes(nodes, cache_size, seed)) {
		goto fail_free_cache_mem;
	}
	ret->cache_size = cache_size;
	return ret;

fail_free_cache_mem:
#if defined(__MIC__)
	_mm_free(ret->cache);
#else
	free(ret->cache);
#endif
fail_free_light:
	free(ret);
	return NULL;
}

ethash_light_t ethash_light_new(uint64_t block_number)
{
	ethash_h256_t seedhash = ethash_get_seedhash(block_number);
	ethash_light_t ret;
	ret = ethash_light_new_internal(ethash_get_cachesize(block_number), &seedhash);
	ret->block_number = block_number;
	return ret;
}

void ethash_light_delete(ethash_light_t light)
{
	if (light->cache) {
		free(light->cache);
	}
	free(light);
}

ethash_return_value_t ethash_light_compute_internal(
	ethash_light_t light,
	uint64_t full_size,
	ethash_h256_t const header_hash,
	uint64_t nonce
)
{
  	ethash_return_value_t ret;
	ret.success = true;
	if (!ethash_hash(&ret, NULL, light, full_size, header_hash, nonce)) {
		ret.success = false;
	}
	return ret;
}

ethash_return_value_t ethash_light_compute(
	ethash_light_t light,
	ethash_h256_t const header_hash,
	uint64_t nonce
)
{
	uint64_t full_size = ethash_get_datasize(light->block_number);
	return ethash_light_compute_internal(light, full_size, header_hash, nonce);
}

static bool ethash_mmap(struct ethash_full* ret, FILE* f)
{
	int fd;
	char* mmapped_data;
	errno = 0;
	ret->file = f;
	if ((fd = ethash_fileno(ret->file)) == -1) {
		return false;
	}
	mmapped_data = mmap(
		NULL,
		(size_t)ret->file_size + ETHASH_DAG_MAGIC_NUM_SIZE,
		PROT_READ | PROT_WRITE,
		MAP_SHARED,
		fd,
		0
	);
	if (mmapped_data == MAP_FAILED) {
		return false;
	}
	ret->data = (node*)(mmapped_data + ETHASH_DAG_MAGIC_NUM_SIZE);
	return true;
}

ethash_full_t ethash_full_new_internal(
	char const* dirname,
	ethash_h256_t const seed_hash,
	uint64_t full_size,
	ethash_light_t const light,
	ethash_callback_t callback
)
{
	struct ethash_full* ret;
	FILE *f = NULL;
	ret = calloc(sizeof(*ret), 1);
	if (!ret) {
		return NULL;
	}
	ret->file_size = (size_t)full_size;

	enum ethash_io_rc err = ethash_io_prepare(dirname, seed_hash, &f, (size_t)full_size, false);
	if (err == ETHASH_IO_FAIL)
		goto fail_free_full;

	if (err == ETHASH_IO_MEMO_SIZE_MISMATCH) {
		// if a DAG of same filename but unexpected size is found, silently force new file creation
		if (ethash_io_prepare(dirname, seed_hash, &f, (size_t)full_size, true) != ETHASH_IO_MEMO_MISMATCH) {
			ETHASH_CRITICAL("Could not recreate DAG file after finding existing DAG with unexpected size.");
			goto fail_free_full;
		}
		// we now need to go through the mismatch case, NOT the match case
		err = ETHASH_IO_MEMO_MISMATCH;
	}

	if (err == ETHASH_IO_MEMO_MISMATCH || err == ETHASH_IO_MEMO_MATCH) {
		if (!ethash_mmap(ret, f)) {
			ETHASH_CRITICAL("mmap failure()");
			goto fail_close_file;
		}

		if (err == ETHASH_IO_MEMO_MATCH) {
#if defined(__MIC__)
			node* tmp_nodes = _mm_malloc((size_t)full_size, 64);
			//copy all nodes from ret->data
			//mmapped_nodes are not aligned properly
			uint32_t const countnodes = (uint32_t) ((size_t)ret->file_size / sizeof(node));
			//fprintf(stderr,"ethash_full_new_internal:countnodes:%d",countnodes);
			for (uint32_t i = 1; i != countnodes; ++i) {
				tmp_nodes[i] = ret->data[i];
			}
			ret->data = tmp_nodes;
#endif
			return ret;
		}
	}

#if defined(__MIC__)
	ret->data = _mm_malloc((size_t)full_size, 64);
#endif
	if (!ethash_compute_full_data(ret->data, full_size, light, callback)) {
		ETHASH_CRITICAL("Failure at computing DAG data.");
		goto fail_free_full_data;
	}

	// after the DAG has been filled then we finalize it by writting the magic number at the beginning
	if (fseek(f, 0, SEEK_SET) != 0) {
		ETHASH_CRITICAL("Could not seek to DAG file start to write magic number.");
		goto fail_free_full_data;
	}
	uint64_t const magic_num = ETHASH_DAG_MAGIC_NUM;
	if (fwrite(&magic_num, ETHASH_DAG_MAGIC_NUM_SIZE, 1, f) != 1) {
		ETHASH_CRITICAL("Could not write magic number to DAG's beginning.");
		goto fail_free_full_data;
	}
	if (fflush(f) != 0) {// make sure the magic number IS there
		ETHASH_CRITICAL("Could not flush memory mapped data to DAG file. Insufficient space?");
		goto fail_free_full_data;
	}
	return ret;

fail_free_full_data:
	// could check that munmap(..) == 0 but even if it did not can't really do anything here
	munmap(ret->data, (size_t)full_size);
#if defined(__MIC__)
	_mm_free(ret->data);
#endif
fail_close_file:
	fclose(ret->file);
fail_free_full:
	free(ret);
	return NULL;
}

ethash_full_t ethash_full_new(ethash_light_t light, ethash_callback_t callback)
{
	char strbuf[256];
	if (!ethash_get_default_dirname(strbuf, 256)) {
		return NULL;
	}
	uint64_t full_size = ethash_get_datasize(light->block_number);
	ethash_h256_t seedhash = ethash_get_seedhash(light->block_number);
	return ethash_full_new_internal(strbuf, seedhash, full_size, light, callback);
}

void ethash_full_delete(ethash_full_t full)
{
	// could check that munmap(..) == 0 but even if it did not can't really do anything here
	munmap(full->data, (size_t)full->file_size);
	if (full->file) {
		fclose(full->file);
	}
	free(full);
}

ethash_return_value_t ethash_full_compute(
	ethash_full_t full,
	ethash_h256_t const header_hash,
	uint64_t nonce
)
{
	ethash_return_value_t ret;
	ret.success = true;
	if (!ethash_hash(
		&ret,
		(node const*)full->data,
		NULL,
		full->file_size,
		header_hash,
		nonce)) {
		ret.success = false;
	}
	return ret;
}

void const* ethash_full_dag(ethash_full_t full)
{
	return full->data;
}

uint64_t ethash_full_dag_size(ethash_full_t full)
{
	return full->file_size;
}

The code is now a bit complex comparing to sha_256 functions, functions are longer and they interleave assembler directive.

Remove unecessay tests et precalculated all data

Remember that a test reinitialize the pipeline of the processor because it implies a jump instruction and as a result kills the sequence of instruction. So if we use “constant” or well known values we can drop tests guard test at the start of function. Note it is not necessary to remove assert function. Indeed theses functions are only generated on debug mode, so a good practice is to use assert as replacement of if to test parameters values.
We can also specialize function (see previous post for example) to remove constant parameters. The aim is to create sha_256_32 when size is 32, sha_256_64 when size is 64 and keep a generic function with a parameter when we can not decide what the size is. The counterpart of this method is it increasing the code size, and we have three duplicate code. So the maintenance will be harder. We can do the same with ethash_hash to remove full_nodes parameter and then remove the test of full_nodes != null line 222.

for (unsigned n = 0; n != MIX_NODES; ++n) { node const* dag_node; node tmp_node; ethash_calculate_dag_item(&tmp_node, index * MIX_NODES + n, light); …. } can be changed to

        unsigned preindex = MIX_NODES * index ;
		for (unsigned n = 0; n != MIX_NODES; ++n) {
			node const* dag_node;
			dag_node = &full_nodes[preindex++];

The differnce is slighty but we replace a leal (load effective adress, which access to register, this code is called 64 times so we saved a little bit of time.

Let’s doing a performance test:

./eth -M -t 1 --benchmark-trial 15
cpp-ethereum, a C++ Ethereum client
   04:43:41 PM.112|eth  #00004000…
Benchmarking on platform: 8-thread CPU
Preparing DAG...
Warming up...
    04:43:41 PM.112|miner0  Loading full DAG of seedhash: #00000000…
    04:43:42 PM.008|miner0  Full DAG loaded
Trial 1... 99273
Trial 2... 101280
Trial 3... 102040
Trial 4... 100733
Trial 5... 101026
min/mean/max: 99273/100870/102040 H/s
inner mean: 101013 H/s

Not bad result, it is the first time we exced 100k H/s.

If you want to test you can get the V1.2 tag,

Specific optimization

Right now the code is written in “pure” C, it means that this code can be without too much effort compiled from raspberry to last end Intel processor.

The last step of optimization is now specific to the target. In this phase we will optimize the code for a specific platform, in our example we will use 128 bits register included in SSE2.0, as a counterpart the code will only work on 64 bits Intel/Amd processor.

In this project implementation had been already done with SSE registers but this cannot be able. In fact to enable this generation you have to done two things:

Tell the compiler that you want to use SSE instructions
Set ENABLE_SSE define to true line 7 in internal.h file

The question you might ask, why this option is not always set to true, and why every modern processor do not use SSE instructions by default ? The answer is not only for historical reasons, but also for performance reason which is at first sight seems contra-intuitive. Indeed if we enable SSE we have more register for us, but switching task will be longer because the processor has to save more registers, and theses registers are essentially used for intensive calculus and are useless on common computer tasks.