Category: Software

Std Chrono, a high resolution timer ?

#include <iostream>
#include <string>
#include <vector>
#include <functional>
#include <chrono>
#include <smmintrin.h>
#include <unistd.h>
#include <glm.hpp>
#include <gtx/simd_vec4.hpp>
#include <gtx/simd_mat4.hpp>
#include <gtc/type_ptr.hpp>
#include <immintrin.h>

namespace ch = std::chrono;

const int Iter = 1<<28;

void RunBench_GLM()
	glm::vec4 v(1.0f);
	glm::vec4 v2;
	glm::mat4 m(1.0f);
	for (int i = 0; i < Iter; i++)
		v2 += m * v;

	auto t = v2;
	std::cout << t.x << " " << t.y << " " << t.z << " " << t.w << std::endl;

void RunBench_GLM_SIMD()
	glm::detail::fvec4SIMD v(1.0f);
	glm::detail::fvec4SIMD v2(0.0f);
	glm::detail::fmat4x4SIMD m(1.0f);

	for (int i = 0; i < Iter; i++)
		v2 += v * m;

	auto t = glm::vec4_cast(v2);
	std::cout << t.x << " " << t.y << " " << t.z << " " << t.w << std::endl;

void RunBench_Double_GLM()
	glm::dvec4 v(1.0);
	glm::dvec4 v2;
	glm::dmat4 m(1.0);

	for (int i = 0; i < Iter; i++)
		v2 += v * m;

	auto t = v2;
	std::cout << t.x << " " << t.y << " " << t.z << " " << t.w << std::endl;

void RunBench_Double_AVX()
	__m256d v = _mm256_set_pd(1, 1, 1, 1);
	__m256d s = _mm256_setzero_pd();
	__m256d m[4] =
		_mm256_set_pd(1, 0, 0, 0),
		_mm256_set_pd(0, 1, 0, 0),
		_mm256_set_pd(0, 0, 1, 0),
		_mm256_set_pd(0, 0, 0, 1)

	for (int i = 0; i < Iter; i++)
		__m256d v0 = _mm256_shuffle_pd(v, v, _MM_SHUFFLE(0, 0, 0, 0));
		__m256d v1 = _mm256_shuffle_pd(v, v, _MM_SHUFFLE(1, 1, 1, 1));
		__m256d v2 = _mm256_shuffle_pd(v, v, _MM_SHUFFLE(2, 2, 2, 2));
		__m256d v3 = _mm256_shuffle_pd(v, v, _MM_SHUFFLE(3, 3, 3, 3));

		__m256d m0 = _mm256_mul_pd(m[0], v0);
		__m256d m1 = _mm256_mul_pd(m[1], v1);
		__m256d m2 = _mm256_mul_pd(m[2], v2);
		__m256d m3 = _mm256_mul_pd(m[3], v3);

		__m256d a0 = _mm256_add_pd(m0, m1);
		__m256d a1 = _mm256_add_pd(m2, m3);
		__m256d a2 = _mm256_add_pd(a0, a1);

		s = _mm256_add_pd(s, a2);

	double t[4];
	_mm256_store_pd(t, s);
	std::cout << t[0] << " " << t[1] << " " << t[2] << " " << t[3] << std::endl;

int main()
	std::vector<std::pair<std::string, std::function<void ()>>> benches;
	benches.push_back(std::make_pair("GLM", RunBench_GLM));
	benches.push_back(std::make_pair("GLM_SIMD", RunBench_GLM_SIMD));
	benches.push_back(std::make_pair("Double_GLM", RunBench_Double_GLM));
	benches.push_back(std::make_pair("Double_AVX", RunBench_Double_AVX));
	auto startInitial = ch::high_resolution_clock::now();
        for (int i=0;i<500000;i++){
        auto endInitial = ch::high_resolution_clock::now();

	double elapsedInitial = (double)ch::duration_cast<ch::milliseconds>(endInitial - startInitial).count() ;
	std::cout << "resolution :" <<elapsedInitial <<std::endl;
	for (auto& bench : benches)
		std::cout << "Begin [ " << bench.first << " ]" << std::endl;

		auto start = ch::high_resolution_clock::now();
		auto end = ch::high_resolution_clock::now();	

		double elapsed = (double)ch::duration_cast<ch::milliseconds>(end - start).count() / 1000.0;
		std::cout << "End [ " << bench.first << " ] : " << elapsed << " seconds" << std::endl;
	return 0;

Quest for the ultimate timer framework

A lot stuff on this blog talks about code optimization, and sometime very small improvement that have performance minimal impacts but in case they are called a lot of time it became difficult to ensure optimization are useful. Let’s take an example. Imagine you have a function that lasts one millisecond. You are optimizing this function and as a result you found two solutions to optimize your code . But if you are using a timer that lasts 0.5 milliseconds, you won’t be able to choose one of this other. The aim of this article is to help you to understand ???

Throughput of the algorithm

In this case


  • Easy to implement
  • Can cover a full range of service, or the lifetime


  • Can be difficult to implement as you invert the timing (aka how many cycles I’ve done in one minute for instance)
  • Initial conditions can be impossible to reproduce
  • Program should maintain this feature


  •  Our ethminer with -m option.

Time with linux command

The time command, is an Unix/Linux standard line command. It will use the internal timer to return the time elapsed by the command.

time ls -l 

real	0m0.715s
user	0m0.000s
sys	0m0.004s

You know that in this blog I like to use example. Imagine you have a 3D application that do a lot complex mathematical calculus function (square root, cosines,..). You know that theses functions are called a lot of time (billion per second). As we have already seen a small improvement in this function can have very strong impact on all the program. Now you know two way to implement this calculus in C by using the standard mathematic library or in assembler which is a bit complex to do but you might achieve better performance. The method I’m gonna to present can be also used when you have two implementations of the same feature and you don’t know which one to choose. If you have to choose how fast is the new version, and do it worth the pain to do it in assembler as the code becomes difficult to maintain.

#include <math.h>
inline double Calc_c(double x,double y,double z){
double tmpx = sqrt(x)*cos(x)/sin(x);
double tmpy = sqrt(y)*cos(y)/sin(y);
double tmpz = sqrt(z)*cos(z)/sin(z);
return (tmpx+tmpy+tmpz)*tmpx+tmpy+tmpz

inline double Calc_as(double x,double y,double z){

    __m512d a1  _mm512_set4_pd(x,y,z,0.0);

We know that the assembler version will be faster but to which value ?


Create your personal Web hosting 1

As some of my personal/professional web hosting are to renewed, I was thinking of moving them to my Olimex A20.

Reasons why I think the A20 can be a solution:

  • low power consumption , so I can leave it on
  • my site have a small audience so performance is not a bottleneck ( I’m not sure a mutual hosting have better performance)
  • I can control exactly services started, can be very formative and challenging
  • Spend my money to buy something that I can use later instead of renting something
  • For the price of one year of web hosting I can buy a new A20
  • It’s challenging and I might learn a lot of things/li>

The other side of the coin

  • Very difficult to configure (you have to configure a lot of thing DNS,Apache,open your internet connection)
  • You are now responsible for the security of your system (you’re opening your system to Internet)

Starting from fresh installation

To decide first I must have an idea of the performance of an installed apache and I will remove all unused things (X-windows for instance)
Removing unecessay modukes and services
I started from a fresh installation and configure the network as I explained on a previous article.
I removed unnecessary X-windows lib, the box will start faster and I can keep data on the sd card.

Removing unnecessary modules

apt-get remove --auto-remove --purge libx11-.*
apt-get autoremove --purge


 /dev/root 3808912 2634612 980816 73% /

After that:

free -m
total used free shared buffers cached
Mem: 874 63 811 0 2 34
-/+ buffers/cache: 26 848
Swap: 0 0 0

I know I can win more memory by switching to mono-user and allow only one terminal. But I will do it in a later version.

You can also remove unnecessary modules:


Module Size Used by
cpufreq_powersave 1207 0
cpufreq_userspace 3318 0
cpufreq_conservative 6042 0
cpufreq_stats 3699 0
g_ether 55821 0
pwm_sunxi 9255 0
gt2005 13408 0
nand 114172 0
sun4i_keyboard 2150 0
ledtrig_heartbeat 1370 0
leds_sunxi 3733 0
led_class 3539 1 leds_sunxi
sunxi_emac 34009 0
sunxi_gmac 29505 0
8192cu 454131 0

free -m
total used free shared buffers cached
Mem: 874 63 811 0 2 34
-/+ buffers/cache: 26 848
Swap: 0 0 0

I can remove modules that handle video memory:
After that:
rmmod sun4i_csi0 videobuf_dma_contig videobuf_core ump lcd sunxi_cedar_mod gt2005
free -m

total used free shared buffers cached
Mem: 874 62 812 0 2 34
-/+ buffers/cache: 25 849
Swap: 0 0 0

<h3>Raw Performance</h3>
By default apache is installed, to test do
<div class=”wp_syntax”>
<td class=”line_numbers”>

wget -O-

He will fetch index.html in /var/www

First step I try to know how much request this apache can handle. From my laptop I use a very useful command named ab that call concurent request and report time and how many request handled. From my console I type


ab -n 4000 -c 100 -g test_data_1.txt

with 4000 means how many request to use, 100 means number of client, test_data_1.tx is the gnuplot output and you can put the url you want to test.

Server Software:        Apache/2.2.22
Document Length:        999 bytes
Concurrency Level:      100
Time taken for tests:   3.426 seconds
Complete requests:      4000
Failed requests:        0
Total transferred:      5104000 bytes
HTML transferred:       3996000 bytes
<strong>Requests per second:    1167.60 [#/sec] (mean)</strong>
Time per request:       85.646 [ms] (mean)
Time per request:       0.856 [ms] (mean, across all concurrent requests)
Transfer rate:          1454.94 [Kbytes/sec] received

To test in a more real example I did on the box

cd /var/www
wget -r -O

From my laptop

ab -n 400 -c 100 -g test_data_1.txt


Document Path:          /
Document Length:        48352 bytes
Concurrency Level:      100
Time taken for tests:   1.756 seconds
Complete requests:      400
Failed requests:        0
Total transferred:      19452800 bytes
HTML transferred:       19340800 bytes
<strong>Requests per second:    227.79 [#/sec] (mean)</strong>
Time per request:       438.995 [ms] (mean)
Time per request:       4.390 [ms] (mean, across all concurrent requests)
Transfer rate:          10818.39 [Kbytes/sec] received
It looks like that the A20, can support the number of request I expect.
In the next article I will test with a running wordpress site, and I will explain how to configure and optimize Apache and put your website on Internet.

Tricks to know

1)How to use Xwindow outside Olinuxino

connect to the Olinuxino with

<code class=" code-embed-code language-bash">ssh <span class="token operator">-</span>X root@ip_box</code>

After connection do

<code class=" code-embed-code language-bash">xhost <span class="token operator">+</span></code>

Now you can launch any X application directly on the box

For instance you can do:


Olinuxino goes to hadoop

One application I would like to test is to know how Olinuxino goes if you using it in an hadoop environment.

At first sight it might not be a very good solution. Indeed it doesn’t have a very big storage habilitity and you can not think of using a smartmedia card as a normal filesystem.

The solution I propose to test is to use a laptop (or another machine) as a master (and hdfs filesystem). I would like to know if filesystem is a bottleneck or if it is the network.

Step 1:Install an hadoop server

In my case it is my laptop and I use a 30Go partion on my sd disk. The IP I will use for my server is 

This step will install a « normal » hadoop distribution on a PC and will be used as

To simplify the best thing is to add alias in /etc/hosts

Step 2: setting up the board

Download standard image fot olinuxino. It can be found here: taken from the official olinuxino github: (

First problem is that by default network is not enabled. Change the file /etc/network/interfaces and add

auto eth0
allow-hotplug eth0
iface eth0 inet dhcp

Then type:

sudo dhclient eth0
/etc/init.d/networking restart

Get your inetaddress on the board by typing

    eth0      Link encap:Ethernet  HWaddr 02:cf:07:01:5a:b7

    inet addr:  Bcast:  Mask:
    sudo apt-get update
    sudo apt-get upgrade
    sudo apt-get install ssh

edit /etc/ssh/sshd_config

and ensure that the line « PermitRootLogin yes » and « StrictModes no » line has theses values

/etc/init.d/ssh restart

To test if everything is correctly setup go to your server computer and type (password by defaut for root is olimex)

ssh -l root

Try also to connect to your server (in my case adress is and I create an account names local)

ssh -l local

Step 1: Adding a User

sudo addgroup hadoop_group
sudo adduser --ingroup hadoop_group hduser1
sudo adduser hduser1 sudo
su – hduser1
vi ~/.bashrc
# add the folowing line 
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-armhf/jre/
export HADOOP_HOME=/home/hduser1/hadoop
export MAHOUT_HOME=/home/hduser1/hadoop/mahout

ssh-keygen -t rsa -P ""
cat $HOME/.ssh/ >> $HOME/.ssh/authorized_keys

Step 2: installing hadoop


sudo aptitude install openjdk-7-jre


tar zxvf hadoop-2.7.0.tar.gz

mv hadoop-2.7.0/hadoop hadoop

Now we have to modify the configuration file to acces to the masternode and the hdfs


set line export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-armhf/jre

edit the file core-site.xml
and add

    <description>A base for other temporary directories.</description>

    <description>The name of the default file system. A URI whose
    scheme and authority determine the FileSystem implementation. The
    uri's scheme determines the config property (fs.SCHEME.impl) naming
    the FileSystem implementation class. The uri's authority is used to
    determine the host, port, etc. for a filesystem.</description>

Now you can tst if evrything works:
hadoop fs -ls

Step 3: installing mahout

One easy way to test hadoop is to install mahout. This project includes several hadoop jobs to classify data. So we will using it for test purpose.

Download it from apache:

cd hadopp
tar zxvf mahout-distribution-0.10.0.tar.gz
mv mahout-distribution-0.10.0 mahout

Now everything should be corectly set.

Step 4: Benchmarking

Go into hadoop/mahout

do: sudo apt-get install curl examples/bin/

Now to bench do

time examples/bin/