#include <iostream> #include <string> #include <vector> #include <functional> #include <chrono> #include <smmintrin.h> #include <unistd.h> #include <glm.hpp> #include <gtx/simd_vec4.hpp> #include <gtx/simd_mat4.hpp> #include <gtc/type_ptr.hpp> #include <immintrin.h> namespace ch = std::chrono; const int Iter = 1<<28; void RunBench_GLM() { glm::vec4 v(1.0f); glm::vec4 v2; glm::mat4 m(1.0f); for (int i = 0; i < Iter; i++) { v2 += m * v; } auto t = v2; std::cout << t.x << " " << t.y << " " << t.z << " " << t.w << std::endl; } void RunBench_GLM_SIMD() { glm::detail::fvec4SIMD v(1.0f); glm::detail::fvec4SIMD v2(0.0f); glm::detail::fmat4x4SIMD m(1.0f); for (int i = 0; i < Iter; i++) { v2 += v * m; } auto t = glm::vec4_cast(v2); std::cout << t.x << " " << t.y << " " << t.z << " " << t.w << std::endl; } void RunBench_Double_GLM() { glm::dvec4 v(1.0); glm::dvec4 v2; glm::dmat4 m(1.0); for (int i = 0; i < Iter; i++) { v2 += v * m; } auto t = v2; std::cout << t.x << " " << t.y << " " << t.z << " " << t.w << std::endl; } void RunBench_Double_AVX() { __m256d v = _mm256_set_pd(1, 1, 1, 1); __m256d s = _mm256_setzero_pd(); __m256d m[4] = { _mm256_set_pd(1, 0, 0, 0), _mm256_set_pd(0, 1, 0, 0), _mm256_set_pd(0, 0, 1, 0), _mm256_set_pd(0, 0, 0, 1) }; for (int i = 0; i < Iter; i++) { __m256d v0 = _mm256_shuffle_pd(v, v, _MM_SHUFFLE(0, 0, 0, 0)); __m256d v1 = _mm256_shuffle_pd(v, v, _MM_SHUFFLE(1, 1, 1, 1)); __m256d v2 = _mm256_shuffle_pd(v, v, _MM_SHUFFLE(2, 2, 2, 2)); __m256d v3 = _mm256_shuffle_pd(v, v, _MM_SHUFFLE(3, 3, 3, 3)); __m256d m0 = _mm256_mul_pd(m[0], v0); __m256d m1 = _mm256_mul_pd(m[1], v1); __m256d m2 = _mm256_mul_pd(m[2], v2); __m256d m3 = _mm256_mul_pd(m[3], v3); __m256d a0 = _mm256_add_pd(m0, m1); __m256d a1 = _mm256_add_pd(m2, m3); __m256d a2 = _mm256_add_pd(a0, a1); s = _mm256_add_pd(s, a2); } double t[4]; _mm256_store_pd(t, s); std::cout << t[0] << " " << t[1] << " " << t[2] << " " << t[3] << std::endl; } int main() { std::vector<std::pair<std::string, std::function<void ()>>> benches; benches.push_back(std::make_pair("GLM", RunBench_GLM)); benches.push_back(std::make_pair("GLM_SIMD", RunBench_GLM_SIMD)); benches.push_back(std::make_pair("Double_GLM", RunBench_Double_GLM)); benches.push_back(std::make_pair("Double_AVX", RunBench_Double_AVX)); auto startInitial = ch::high_resolution_clock::now(); for (int i=0;i<500000;i++){ asm("NOP"); } auto endInitial = ch::high_resolution_clock::now(); double elapsedInitial = (double)ch::duration_cast<ch::milliseconds>(endInitial - startInitial).count() ; std::cout << "resolution :" <<elapsedInitial <<std::endl; for (auto& bench : benches) { std::cout << "Begin [ " << bench.first << " ]" << std::endl; auto start = ch::high_resolution_clock::now(); bench.second(); auto end = ch::high_resolution_clock::now(); double elapsed = (double)ch::duration_cast<ch::milliseconds>(end - start).count() / 1000.0; std::cout << "End [ " << bench.first << " ] : " << elapsed << " seconds" << std::endl; } std::cin.get(); return 0; }
Category: Software
Quest for the ultimate timer framework
A lot stuff on this blog talks about code optimization, and sometime very small improvement that have performance minimal impacts but in case they are called a lot of time it became difficult to ensure optimization are useful. Let’s take an example. Imagine you have a function that lasts one millisecond. You are optimizing this function and as a result you found two solutions to optimize your code . But if you are using a timer that lasts 0.5 milliseconds, you won’t be able to choose one of this other. The aim of this article is to help you to understand ???
Throughput of the algorithm
In this case
Pro:
- Easy to implement
- Can cover a full range of service, or the lifetime
Con
- Can be difficult to implement as you invert the timing (aka how many cycles I’ve done in one minute for instance)
- Initial conditions can be impossible to reproduce
- Program should maintain this feature
Example:
- Our ethminer with -m option.
Time with linux command
The time command, is an Unix/Linux standard line command. It will use the internal timer to return the time elapsed by the command.
time ls -l real 0m0.715s user 0m0.000s sys 0m0.004s
You know that in this blog I like to use example. Imagine you have a 3D application that do a lot complex mathematical calculus function (square root, cosines,..). You know that theses functions are called a lot of time (billion per second). As we have already seen a small improvement in this function can have very strong impact on all the program. Now you know two way to implement this calculus in C by using the standard mathematic library or in assembler which is a bit complex to do but you might achieve better performance. The method I’m gonna to present can be also used when you have two implementations of the same feature and you don’t know which one to choose. If you have to choose how fast is the new version, and do it worth the pain to do it in assembler as the code becomes difficult to maintain.
#include <math.h> inline double Calc_c(double x,double y,double z){ double tmpx = sqrt(x)*cos(x)/sin(x); double tmpy = sqrt(y)*cos(y)/sin(y); double tmpz = sqrt(z)*cos(z)/sin(z); return (tmpx+tmpy+tmpz)*tmpx+tmpy+tmpz } inline double Calc_as(double x,double y,double z){ __m512d a1 _mm512_set4_pd(x,y,z,0.0); } We know that the assembler version will be faster but to which value ?
Create your personal Web hosting 1
As some of my personal/professional web hosting are to renewed, I was thinking of moving them to my Olimex A20.
Reasons why I think the A20 can be a solution:
- low power consumption , so I can leave it on
- my site have a small audience so performance is not a bottleneck ( I’m not sure a mutual hosting have better performance)
- I can control exactly services started, can be very formative and challenging
- Spend my money to buy something that I can use later instead of renting something
- For the price of one year of web hosting I can buy a new A20
- It’s challenging and I might learn a lot of things/li>
The other side of the coin
- Very difficult to configure (you have to configure a lot of thing DNS,Apache,open your internet connection)
- You are now responsible for the security of your system (you’re opening your system to Internet)
Starting from fresh installation
To decide first I must have an idea of the performance of an installed apache and I will remove all unused things (X-windows for instance)
Removing unecessay modukes and services
I started from a fresh installation and configure the network as I explained on a previous article.
I removed unnecessary X-windows lib, the box will start faster and I can keep data on the sd card.
Removing unnecessary modules
apt-get remove --auto-remove --purge libx11-.* apt-get autoremove --purge
Before:
/dev/root 3808912 2634612 980816 73% /
After that:
free -m total used free shared buffers cached Mem: 874 63 811 0 2 34 -/+ buffers/cache: 26 848 Swap: 0 0 0
I know I can win more memory by switching to mono-user and allow only one terminal. But I will do it in a later version.
You can also remove unnecessary modules:
lsmod Module Size Used by cpufreq_powersave 1207 0 cpufreq_userspace 3318 0 cpufreq_conservative 6042 0 cpufreq_stats 3699 0 g_ether 55821 0 pwm_sunxi 9255 0 gt2005 13408 0 nand 114172 0 sun4i_keyboard 2150 0 ledtrig_heartbeat 1370 0 leds_sunxi 3733 0 led_class 3539 1 leds_sunxi sunxi_emac 34009 0 sunxi_gmac 29505 0 8192cu 454131 0 free -m total used free shared buffers cached Mem: 874 63 811 0 2 34 -/+ buffers/cache: 26 848 Swap: 0 0 0
I can remove modules that handle video memory:
After that:
rmmod sun4i_csi0 videobuf_dma_contig videobuf_core ump lcd sunxi_cedar_mod gt2005
free -m
total used free shared buffers cached
Mem: 874 62 812 0 2 34
-/+ buffers/cache: 25 849
Swap: 0 0 0
<h3>Raw Performance</h3>
By default apache is installed, to test do
<div class=”wp_syntax”>
<table>
<tbody>
<tr>
<td class=”line_numbers”>
<pre>1
wget -O- http://192.168.1.171/
He will fetch index.html in /var/www
First step I try to know how much request this apache can handle. From my laptop I use a very useful command named ab that call concurent request and report time and how many request handled. From my console I type
1 |
ab -n 4000 -c 100 -g test_data_1.txt http://192.168.1.71/ |
with 4000 means how many request to use, 100 means number of client, test_data_1.tx is the gnuplot output and you can put the url you want to test.
1 2 3 4 5 6 7 8 9 10 11 12 |
Server Software: Apache/2.2.22 Document Length: 999 bytes Concurrency Level: 100 Time taken for tests: 3.426 seconds Complete requests: 4000 Failed requests: 0 Total transferred: 5104000 bytes HTML transferred: 3996000 bytes <strong>Requests per second: 1167.60 [#/sec] (mean)</strong> Time per request: 85.646 [ms] (mean) Time per request: 0.856 [ms] (mean, across all concurrent requests) Transfer rate: 1454.94 [Kbytes/sec] received |
To test in a more real example I did on the box
1 2 |
cd /var/www wget -r -O http://olinuxino.4pro-web.com/ |
From my laptop
1 |
ab -n 400 -c 100 -g test_data_1.txt http://192.168.1.71/olinuxino.4pro-web.com/index.html |
SELRES_0.5579080039592403SELRES_0.5579080039592403
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
Document Path: /olinuxino.4pro-web.com/index.html Document Length: 48352 bytes Concurrency Level: 100 Time taken for tests: 1.756 seconds Complete requests: 400 Failed requests: 0 Total transferred: 19452800 bytes HTML transferred: 19340800 bytes <strong>Requests per second: 227.79 [#/sec] (mean)</strong> Time per request: 438.995 [ms] (mean) Time per request: 4.390 [ms] (mean, across all concurrent requests) Transfer rate: 10818.39 [Kbytes/sec] received It looks like that the A20, can support the number of request I expect. In the next article I will test with a running wordpress site, and I will explain how to configure and optimize Apache and put your website on Internet. |
Tricks to know
1)How to use Xwindow outside Olinuxino
connect to the Olinuxino with
After connection do
Now you can launch any X application directly on the box
For instance you can do:
synaptic
Olinuxino goes to hadoop
One application I would like to test is to know how Olinuxino goes if you using it in an hadoop environment.
At first sight it might not be a very good solution. Indeed it doesn’t have a very big storage habilitity and you can not think of using a smartmedia card as a normal filesystem.
The solution I propose to test is to use a laptop (or another machine) as a master (and hdfs filesystem). I would like to know if filesystem is a bottleneck or if it is the network.
Step 1:Install an hadoop server
In my case it is my laptop and I use a 30Go partion on my sd disk. The IP I will use for my server is 192.168.1.7.
This step will install a « normal » hadoop distribution on a PC and will be used as
To simplify the best thing is to add alias in /etc/hosts
Step 2: setting up the board
Download standard image fot olinuxino. It can be found here: https://www.olimex.com/wiki/images/2/29/Debian_FS_34_90_camera_A20-olimex.torrent taken from the official olinuxino github: (https://github.com/OLIMEX/OLINUXINO/tree/master/SOFTWARE/A20/A20-build).
First problem is that by default network is not enabled. Change the file /etc/network/interfaces and add
Then type:
Get your inetaddress on the board by typing
edit /etc/ssh/sshd_config
and ensure that the line « PermitRootLogin yes » and « StrictModes no » line has theses values
/etc/init.d/ssh restart
To test if everything is correctly setup go to your server computer and type (password by defaut for root is olimex)
ssh 192.168.1.254 -l root
Try also to connect to your server (in my case adress is 192.168.1.75 and I create an account names local)
ssh 192.168.1.75 -l local
Step 1: Adding a User
Step 2: installing hadoop
sudo aptitude install openjdk-7-jre
wget http://apache.mirrors.ovh.net/ftp.apache.org/dist/hadoop/common/hadoop-2.7.0/hadoop-2.7.0.tar.gz
tar zxvf hadoop-2.7.0.tar.gz
mv hadoop-2.7.0/hadoop hadoop
Now we have to modify the configuration file to acces to the masternode and the hdfs
Edit hadoop-env.sh
edit the file core-site.xml
and add
Now you can tst if evrything works:
hadoop fs -ls
Step 3: installing mahout
One easy way to test hadoop is to install mahout. This project includes several hadoop jobs to classify data. So we will using it for test purpose.
Download it from apache:
Now everything should be corectly set.
Step 4: Benchmarking
Go into hadoop/mahout
do: sudo apt-get install curl examples/bin/classify-wikipedia.sh
Now to bench do
time examples/bin/classify-wikipedia.sh