Monthly Archive: January 2018

SSE usage Tutorial

Hi all,

SSE usage is a bit tricky

You have to see see registers as vectorial not a linear and their size is depending of the context.

For instance xmm0 (a SSE 128 bytes register)can be seen as 2*64 bits register or 4*32 bits register or 8*16  bits register or 16*8 bits register.

So what is the aim of this ?

If you want to add two arrays the algorithm wil be

int x [4] ; //let's say that sizeof(int) =32
int y[4];
int z[4];

c[0]=a[0]b[0]
c[1]=a[1]+b[1]
c[2]=a[2]+b[2]
c[3]=a[3]+b[3]

In assembler it gives :

       .loc 1 6 0
        mov     edx, DWORD PTR [rbp-48]
        mov     eax, DWORD PTR [rbp-32]
        add     eax, edx
        mov     DWORD PTR [rbp-16], eax
        .loc 1 7 0
        mov     edx, DWORD PTR [rbp-44]
        mov     eax, DWORD PTR [rbp-28]
        add     eax, edx
        mov     DWORD PTR [rbp-12], eax
        .loc 1 8 0
        mov     edx, DWORD PTR [rbp-40]
        mov     eax, DWORD PTR [rbp-24]
        add     eax, edx
        mov     DWORD PTR [rbp-8], eax
        .loc 1 9 0
        mov     edx, DWORD PTR [rbp-36]
        mov     eax, DWORD PTR [rbp-20]
        add     eax, edx
        mov     DWORD PTR [rbp-4], eax
        mov     eax, 0

For instance c[0] = a[0]+b[0] is generated like that:

       .loc 1 6 0
        mov     edx, DWORD PTR [rbp-48] ; a[0]
        mov     eax, DWORD PTR [rbp-32] ; b[0]
        add     eax, edx ;eax <- a[0]+b[0]
        mov     DWORD PTR [rbp-16], eax ; c[0] = eaz

And we are doing that 4 times. But thanks to the SSE extension operator we can do it with less instructions

The Streaming SIMD Extensions enhance the x86 architecture in four ways:

  1. 8 new 128-bit SIMD floating-point registers that can be directly addressed;
  2. 50 new instructions that work on packed floating-point data;
  3. 8 new instructions designed tocontrol cacheability of all MMX and 32-bit x86 data types, including the ability to stream data to memory without polluting the caches, and to prefetch data before it is actually used;
  4. 12 new instructions that extend the instruction set.

This set enables the programmer to develop algorithms that can mix packed, single-precision, floating-point and integer using both SSE and MMX instructions respectively.

Intel SSE provides eight 128-bit general-purpose registers, each of which can be directly addressed using the register names XMM0 to XMM7. Each register consists of four 32-bit single precision, floating-point numbers, numbered 0 through 3.

SSE instructions operate on either all or the least significant pairs of packed data operands in parallel. The packed instructions (with PS suffix) operate on a pair of operands, while scalar instructions (with SS suffix) always operate on the least significant pair of the two operands; for scalar operations, the three upper components from the first operand are passed through to the destination.

There are two ways to use SSE registers

Scalar the same 4 instructions on 4 datas

Packed

(thanks to Stefano Tommesani)

So let’s return to our code. I think you gonna understand where I want to go. I we fill two registers with 4 values (a[0]..a[3]) in one register and (c[0]..c[3]), add them together and put the result in a third register. With this solution we will do only one addition.

#include 
#include 
#include 

void p128_hex_u8(__m128i in) {
    uint8_t v[16];
    _mm_store_si128((__m128i*)v, in);
    printf("v16_u8: %x %x %x %x | %x %x %x %x | %x %x %x %x | %x %x %x %xn",
           v[0], v[1],  v[2],  v[3],  v[4],  v[5],  v[6],  v[7],
           v[8], v[9], v[10], v[11], v[12], v[13], v[14], v[15]);
}
 
void p128_hex_u16(__m128i in) {
    uint16_t v[8];
    _mm_store_si128((__m128i*)v, in);
    printf("v8_u16: %x %x %x %x,  %x %x %x %xn", v[0], v[1], v[2], v[3], v[4], v[5], v[6], v[7]);
}
 
void p128_hex_u32(__m128i in) {
    uint32_t v[4] __attribute__((aligned (16)));
    _mm_store_si128((__m128i*)v, in);
    printf("v4_u32: %x %x %x %xn", v[0], v[1], v[2], v[3]);
}
 
void p128_dec_u32(__m128i in) {
    uint32_t v[4] __attribute__((aligned (16)));
    _mm_store_si128((__m128i*)v, in);
    printf("v4_u32: %d %d %d %dn",(uint32_t) v[0], (uint32_t) v[1], (uint32_t)v[2],(uint32_t) v[3]);
}

void p128_hex_u64(__m128i in) {
    long long v[2];  // uint64_t might give format-string warnings with %llx; it's just long in some ABIs
    _mm_store_si128((__m128i*)v, in);
    printf("v2_u64: %llx %llxn", v[0], v[1]);
}
 
 
int main(){
uint32_t a [4] ={1,2,3,4}; //let's say that sizeof(int) = 32
uint32_t b[4] = {11,12,13,14};
uint32_t c[4];
 
c[0]=a[0]+b[0];
c[1]=a[1]+b[1];
c[2]=a[2]+b[2];
c[3]=a[3]+b[3];
printf("Result %d %d %d %dn",c[0],c[1],c[2],c[3]);
 
     __m128i a1 = _mm_set_epi32(a[3], a[2], a[1], a[0]);
     __m128i b1 = _mm_set_epi32(b[3], b[2], b[1], b[0]);
     __m128i c1 = _mm_add_epi32(a1, b1);
     p128_dec_u32(a1);
     p128_dec_u32(b1);
     p128_dec_u32(c1);
}

This a very simple example, as your compiler can already optimize your code with this

Ethminer Optimization part 2

In the previous article we started  sha* functions optimizations, now as we previously seen a large bottleneck performance is in internal.c.

/*
  This file is part of ethash.

  ethash is free software: you can redistribute it and/or modify
  it under the terms of the GNU General Public License as published by
  the Free Software Foundation, either version 3 of the License, or
  (at your option) any later version.

  ethash is distributed in the hope that it will be useful,
  but WITHOUT ANY WARRANTY; without even the implied warranty of
  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.	See the
  GNU General Public License for more details.

  You should have received a copy of the GNU General Public License
  along with cpp-ethereum.	If not, see <http://www.gnu.org/licenses/>.
*/
/** @file internal.c
* @author Tim Hughes <tim@twistedfury.com>
* @author Matthew Wampler-Doty
* @date 2015
*/

#include <assert.h>
#include <inttypes.h>
#include <stddef.h>
#include <errno.h>
#include <math.h>
#include "mmap.h"
#include "ethash.h"
#include "fnv.h"
#include "endian.h"
#include "internal.h"
#include "data_sizes.h"
#include "io.h"

#ifdef WITH_CRYPTOPP

#include "sha3_cryptopp.h"

#else
#include "sha3.h"
#endif // WITH_CRYPTOPP

uint64_t ethash_get_datasize(uint64_t const block_number)
{
	assert(block_number / ETHASH_EPOCH_LENGTH < 2048);
	return dag_sizes[block_number / ETHASH_EPOCH_LENGTH];
}

uint64_t ethash_get_cachesize(uint64_t const block_number)
{
	assert(block_number / ETHASH_EPOCH_LENGTH < 2048);
	return cache_sizes[block_number / ETHASH_EPOCH_LENGTH];
}

// Follows Sergio's "STRICT MEMORY HARD HASHING FUNCTIONS" (2014)
// https://bitslog.files.wordpress.com/2013/12/memohash-v0-3.pdf
// SeqMemoHash(s, R, N)
static bool ethash_compute_cache_nodes(
	node* const nodes,
	uint64_t cache_size,
	ethash_h256_t const* seed
)
{
	if (cache_size % sizeof(node) != 0) {
		return false;
	}
	uint32_t const num_nodes = (uint32_t) (cache_size / sizeof(node));

	SHA3_512(nodes[0].bytes, (uint8_t*)seed, 32);

	for (uint32_t i = 1; i != num_nodes; ++i) {
		SHA3_512(nodes[i].bytes, nodes[i - 1].bytes, 64);
	}

	for (uint32_t j = 0; j != ETHASH_CACHE_ROUNDS; j++) {
		for (uint32_t i = 0; i != num_nodes; i++) {
			uint32_t const idx = nodes[i].words[0] % num_nodes;
			node data;
			data = nodes[(num_nodes - 1 + i) % num_nodes];
			for (uint32_t w = 0; w != NODE_WORDS; ++w) {
				data.words[w] ^= nodes[idx].words[w];
			}
			SHA3_512(nodes[i].bytes, data.bytes, sizeof(data));
		}
	}

	// now perform endian conversion
	fix_endian_arr32(nodes->words, num_nodes * NODE_WORDS);
	return true;
}

void ethash_calculate_dag_item(
	node* const ret,
	uint32_t node_index,
	ethash_light_t const light
)
{
	uint32_t num_parent_nodes = (uint32_t) (light->cache_size / sizeof(node));
	node const* cache_nodes = (node const *) light->cache;
	node const* init = &cache_nodes[node_index % num_parent_nodes];
	memcpy(ret, init, sizeof(node));
	ret->words[0] ^= node_index;
	SHA3_512(ret->bytes, ret->bytes, sizeof(node));
#if defined(_M_X64) && ENABLE_SSE
	__m128i const fnv_prime = _mm_set1_epi32(FNV_PRIME);
	__m128i xmm0 = ret->xmm[0];
	__m128i xmm1 = ret->xmm[1];
	__m128i xmm2 = ret->xmm[2];
	__m128i xmm3 = ret->xmm[3];
#elif defined(__MIC__)
	__m512i const fnv_prime = _mm512_set1_epi32(FNV_PRIME);
	__m512i zmm0 = ret->zmm[0];
#endif

	for (uint32_t i = 0; i != ETHASH_DATASET_PARENTS; ++i) {
		uint32_t parent_index = fnv_hash(node_index ^ i, ret->words[i % NODE_WORDS]) % num_parent_nodes;
		node const *parent = &cache_nodes[parent_index];

#if defined(_M_X64) && ENABLE_SSE
		{
			xmm0 = _mm_mullo_epi32(xmm0, fnv_prime);
			xmm1 = _mm_mullo_epi32(xmm1, fnv_prime);
			xmm2 = _mm_mullo_epi32(xmm2, fnv_prime);
			xmm3 = _mm_mullo_epi32(xmm3, fnv_prime);
			xmm0 = _mm_xor_si128(xmm0, parent->xmm[0]);
			xmm1 = _mm_xor_si128(xmm1, parent->xmm[1]);
			xmm2 = _mm_xor_si128(xmm2, parent->xmm[2]);
			xmm3 = _mm_xor_si128(xmm3, parent->xmm[3]);

			// have to write to ret as values are used to compute index
			ret->xmm[0] = xmm0;
			ret->xmm[1] = xmm1;
			ret->xmm[2] = xmm2;
			ret->xmm[3] = xmm3;
		}
		#elif defined(__MIC__)
		{
			zmm0 = _mm512_mullo_epi32(zmm0, fnv_prime);

			// have to write to ret as values are used to compute index
			zmm0 = _mm512_xor_si512(zmm0, parent->zmm[0]);
			ret->zmm[0] = zmm0;
		}
		#else
		{
			for (unsigned w = 0; w != NODE_WORDS; ++w) {
				ret->words[w] = fnv_hash(ret->words[w], parent->words[w]);
			}
		}
#endif
	}
	SHA3_512(ret->bytes, ret->bytes, sizeof(node));
}

bool ethash_compute_full_data(
	void* mem,
	uint64_t full_size,
	ethash_light_t const light,
	ethash_callback_t callback
)
{
	if (full_size % (sizeof(uint32_t) * MIX_WORDS) != 0 ||
		(full_size % sizeof(node)) != 0) {
		return false;
	}
	uint32_t const max_n = (uint32_t)(full_size / sizeof(node));
	node* full_nodes = mem;
	double const progress_change = 1.0f / max_n;
	double progress = 0.0f;
	// now compute full nodes
	for (uint32_t n = 0; n != max_n; ++n) {
		if (callback &&
			n % (max_n / 100) == 0 &&
			callback((unsigned int)(ceil(progress * 100.0f))) != 0) {

			return false;
		}
		progress += progress_change;
		ethash_calculate_dag_item(&(full_nodes[n]), n, light);
	}
	return true;
}

static bool ethash_hash(
	ethash_return_value_t* ret,
	node const* full_nodes,
	ethash_light_t const light,
	uint64_t full_size,
	ethash_h256_t const header_hash,
	uint64_t const nonce
)
{
	if (full_size % MIX_WORDS != 0) {
		return false;
	}

	// pack hash and nonce together into first 40 bytes of s_mix
	assert(sizeof(node) * 8 == 512);
	node s_mix[MIX_NODES + 1];
	memcpy(s_mix[0].bytes, &header_hash, 32);
	fix_endian64(s_mix[0].double_words[4], nonce);

	// compute sha3-512 hash and replicate across mix
	SHA3_512(s_mix->bytes, s_mix->bytes, 40);
	fix_endian_arr32(s_mix[0].words, 16);

	node* const mix = s_mix + 1;
	for (uint32_t w = 0; w != MIX_WORDS; ++w) {
		mix->words[w] = s_mix[0].words[w % NODE_WORDS];
	}

	unsigned const page_size = sizeof(uint32_t) * MIX_WORDS;
	unsigned const num_full_pages = (unsigned) (full_size / page_size);

	for (unsigned i = 0; i != ETHASH_ACCESSES; ++i) {
		uint32_t const index = fnv_hash(s_mix->words[0] ^ i, mix->words[i % MIX_WORDS]) % num_full_pages;

		for (unsigned n = 0; n != MIX_NODES; ++n) {
			node const* dag_node;
			node tmp_node;
			if (full_nodes) {
				dag_node = &full_nodes[MIX_NODES * index + n];
			} else {
				ethash_calculate_dag_item(&tmp_node, index * MIX_NODES + n, light);
				dag_node = &tmp_node;
			}

#if defined(_M_X64) && ENABLE_SSE
			{
				__m128i fnv_prime = _mm_set1_epi32(FNV_PRIME);
				__m128i xmm0 = _mm_mullo_epi32(fnv_prime, mix[n].xmm[0]);
				__m128i xmm1 = _mm_mullo_epi32(fnv_prime, mix[n].xmm[1]);
				__m128i xmm2 = _mm_mullo_epi32(fnv_prime, mix[n].xmm[2]);
				__m128i xmm3 = _mm_mullo_epi32(fnv_prime, mix[n].xmm[3]);
				mix[n].xmm[0] = _mm_xor_si128(xmm0, dag_node->xmm[0]);
				mix[n].xmm[1] = _mm_xor_si128(xmm1, dag_node->xmm[1]);
				mix[n].xmm[2] = _mm_xor_si128(xmm2, dag_node->xmm[2]);
				mix[n].xmm[3] = _mm_xor_si128(xmm3, dag_node->xmm[3]);
			}
			#elif defined(__MIC__)
			{
				// __m512i implementation via union
				//	Each vector register (zmm) can store sixteen 32-bit integer numbers
				__m512i fnv_prime = _mm512_set1_epi32(FNV_PRIME);
				__m512i zmm0 = _mm512_mullo_epi32(fnv_prime, mix[n].zmm[0]);
				mix[n].zmm[0] = _mm512_xor_si512(zmm0, dag_node->zmm[0]);
			}
			#else
			{
				for (unsigned w = 0; w != NODE_WORDS; ++w) {
					mix[n].words[w] = fnv_hash(mix[n].words[w], dag_node->words[w]);
				}
			}
#endif
		}

	}

// Workaround for a GCC regression which causes a bogus -Warray-bounds warning.
// The regression was introduced in GCC 4.8.4, fixed in GCC 5.0.0 and backported to GCC 4.9.3 but
// never to the GCC 4.8.x line.
//
// See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=56273
//
// This regression is affecting Debian Jesse (8.5) builds of cpp-ethereum (GCC 4.9.2) and also
// manifests in the doublethinkco armel v5 cross-builds, which use crosstool-ng and resulting
// in the use of GCC 4.8.4.  The Tizen runtime wants an even older GLIBC version - the one from
// GCC 4.6.0!

#if defined(__GNUC__) && (__GNUC__ < 5)
#pragma GCC diagnostic push
#pragma GCC diagnostic ignored "-Warray-bounds"
#endif // define (__GNUC__)

	// compress mix
	for (uint32_t w = 0; w != MIX_WORDS; w += 4) {
		uint32_t reduction = mix->words[w + 0];
		reduction = reduction * FNV_PRIME ^ mix->words[w + 1];
		reduction = reduction * FNV_PRIME ^ mix->words[w + 2];
		reduction = reduction * FNV_PRIME ^ mix->words[w + 3];
		mix->words[w / 4] = reduction;
	}

#if defined(__GNUC__) && (__GNUC__ < 5)
#pragma GCC diagnostic pop
#endif // define (__GNUC__)

	fix_endian_arr32(mix->words, MIX_WORDS / 4);
	memcpy(&ret->mix_hash, mix->bytes, 32);
	// final Keccak hash
	SHA3_256(&ret->result, s_mix->bytes, 64 + 32); // Keccak-256(s + compressed_mix)
	return true;
}

void ethash_quick_hash(
	ethash_h256_t* return_hash,
	ethash_h256_t const* header_hash,
	uint64_t const nonce,
	ethash_h256_t const* mix_hash
)
{
	uint8_t buf[64 + 32];
	memcpy(buf, header_hash, 32);
	fix_endian64_same(nonce);
	memcpy(&(buf[32]), &nonce, 8);
	SHA3_512(buf, buf, 40);
	memcpy(&(buf[64]), mix_hash, 32);
	SHA3_256(return_hash, buf, 64 + 32);
}

ethash_h256_t ethash_get_seedhash(uint64_t block_number)
{
	ethash_h256_t ret;
	ethash_h256_reset(&ret);
	uint64_t const epochs = block_number / ETHASH_EPOCH_LENGTH;
	for (uint32_t i = 0; i < epochs; ++i)
		SHA3_256(&ret, (uint8_t*)&ret, 32);
	return ret;
}

bool ethash_quick_check_difficulty(
	ethash_h256_t const* header_hash,
	uint64_t const nonce,
	ethash_h256_t const* mix_hash,
	ethash_h256_t const* boundary
)
{

	ethash_h256_t return_hash;
	ethash_quick_hash(&return_hash, header_hash, nonce, mix_hash);
	return ethash_check_difficulty(&return_hash, boundary);
}

ethash_light_t ethash_light_new_internal(uint64_t cache_size, ethash_h256_t const* seed)
{
	struct ethash_light *ret;
	ret = calloc(sizeof(*ret), 1);
	if (!ret) {
		return NULL;
	}
#if defined(__MIC__)
	ret->cache = _mm_malloc((size_t)cache_size, 64);
#else
	ret->cache = malloc((size_t)cache_size);
#endif
	if (!ret->cache) {
		goto fail_free_light;
	}
	node* nodes = (node*)ret->cache;
	if (!ethash_compute_cache_nodes(nodes, cache_size, seed)) {
		goto fail_free_cache_mem;
	}
	ret->cache_size = cache_size;
	return ret;

fail_free_cache_mem:
#if defined(__MIC__)
	_mm_free(ret->cache);
#else
	free(ret->cache);
#endif
fail_free_light:
	free(ret);
	return NULL;
}

ethash_light_t ethash_light_new(uint64_t block_number)
{
	ethash_h256_t seedhash = ethash_get_seedhash(block_number);
	ethash_light_t ret;
	ret = ethash_light_new_internal(ethash_get_cachesize(block_number), &seedhash);
	ret->block_number = block_number;
	return ret;
}

void ethash_light_delete(ethash_light_t light)
{
	if (light->cache) {
		free(light->cache);
	}
	free(light);
}

ethash_return_value_t ethash_light_compute_internal(
	ethash_light_t light,
	uint64_t full_size,
	ethash_h256_t const header_hash,
	uint64_t nonce
)
{
  	ethash_return_value_t ret;
	ret.success = true;
	if (!ethash_hash(&ret, NULL, light, full_size, header_hash, nonce)) {
		ret.success = false;
	}
	return ret;
}

ethash_return_value_t ethash_light_compute(
	ethash_light_t light,
	ethash_h256_t const header_hash,
	uint64_t nonce
)
{
	uint64_t full_size = ethash_get_datasize(light->block_number);
	return ethash_light_compute_internal(light, full_size, header_hash, nonce);
}

static bool ethash_mmap(struct ethash_full* ret, FILE* f)
{
	int fd;
	char* mmapped_data;
	errno = 0;
	ret->file = f;
	if ((fd = ethash_fileno(ret->file)) == -1) {
		return false;
	}
	mmapped_data = mmap(
		NULL,
		(size_t)ret->file_size + ETHASH_DAG_MAGIC_NUM_SIZE,
		PROT_READ | PROT_WRITE,
		MAP_SHARED,
		fd,
		0
	);
	if (mmapped_data == MAP_FAILED) {
		return false;
	}
	ret->data = (node*)(mmapped_data + ETHASH_DAG_MAGIC_NUM_SIZE);
	return true;
}

ethash_full_t ethash_full_new_internal(
	char const* dirname,
	ethash_h256_t const seed_hash,
	uint64_t full_size,
	ethash_light_t const light,
	ethash_callback_t callback
)
{
	struct ethash_full* ret;
	FILE *f = NULL;
	ret = calloc(sizeof(*ret), 1);
	if (!ret) {
		return NULL;
	}
	ret->file_size = (size_t)full_size;

	enum ethash_io_rc err = ethash_io_prepare(dirname, seed_hash, &f, (size_t)full_size, false);
	if (err == ETHASH_IO_FAIL)
		goto fail_free_full;

	if (err == ETHASH_IO_MEMO_SIZE_MISMATCH) {
		// if a DAG of same filename but unexpected size is found, silently force new file creation
		if (ethash_io_prepare(dirname, seed_hash, &f, (size_t)full_size, true) != ETHASH_IO_MEMO_MISMATCH) {
			ETHASH_CRITICAL("Could not recreate DAG file after finding existing DAG with unexpected size.");
			goto fail_free_full;
		}
		// we now need to go through the mismatch case, NOT the match case
		err = ETHASH_IO_MEMO_MISMATCH;
	}

	if (err == ETHASH_IO_MEMO_MISMATCH || err == ETHASH_IO_MEMO_MATCH) {
		if (!ethash_mmap(ret, f)) {
			ETHASH_CRITICAL("mmap failure()");
			goto fail_close_file;
		}

		if (err == ETHASH_IO_MEMO_MATCH) {
#if defined(__MIC__)
			node* tmp_nodes = _mm_malloc((size_t)full_size, 64);
			//copy all nodes from ret->data
			//mmapped_nodes are not aligned properly
			uint32_t const countnodes = (uint32_t) ((size_t)ret->file_size / sizeof(node));
			//fprintf(stderr,"ethash_full_new_internal:countnodes:%d",countnodes);
			for (uint32_t i = 1; i != countnodes; ++i) {
				tmp_nodes[i] = ret->data[i];
			}
			ret->data = tmp_nodes;
#endif
			return ret;
		}
	}

#if defined(__MIC__)
	ret->data = _mm_malloc((size_t)full_size, 64);
#endif
	if (!ethash_compute_full_data(ret->data, full_size, light, callback)) {
		ETHASH_CRITICAL("Failure at computing DAG data.");
		goto fail_free_full_data;
	}

	// after the DAG has been filled then we finalize it by writting the magic number at the beginning
	if (fseek(f, 0, SEEK_SET) != 0) {
		ETHASH_CRITICAL("Could not seek to DAG file start to write magic number.");
		goto fail_free_full_data;
	}
	uint64_t const magic_num = ETHASH_DAG_MAGIC_NUM;
	if (fwrite(&magic_num, ETHASH_DAG_MAGIC_NUM_SIZE, 1, f) != 1) {
		ETHASH_CRITICAL("Could not write magic number to DAG's beginning.");
		goto fail_free_full_data;
	}
	if (fflush(f) != 0) {// make sure the magic number IS there
		ETHASH_CRITICAL("Could not flush memory mapped data to DAG file. Insufficient space?");
		goto fail_free_full_data;
	}
	return ret;

fail_free_full_data:
	// could check that munmap(..) == 0 but even if it did not can't really do anything here
	munmap(ret->data, (size_t)full_size);
#if defined(__MIC__)
	_mm_free(ret->data);
#endif
fail_close_file:
	fclose(ret->file);
fail_free_full:
	free(ret);
	return NULL;
}

ethash_full_t ethash_full_new(ethash_light_t light, ethash_callback_t callback)
{
	char strbuf[256];
	if (!ethash_get_default_dirname(strbuf, 256)) {
		return NULL;
	}
	uint64_t full_size = ethash_get_datasize(light->block_number);
	ethash_h256_t seedhash = ethash_get_seedhash(light->block_number);
	return ethash_full_new_internal(strbuf, seedhash, full_size, light, callback);
}

void ethash_full_delete(ethash_full_t full)
{
	// could check that munmap(..) == 0 but even if it did not can't really do anything here
	munmap(full->data, (size_t)full->file_size);
	if (full->file) {
		fclose(full->file);
	}
	free(full);
}

ethash_return_value_t ethash_full_compute(
	ethash_full_t full,
	ethash_h256_t const header_hash,
	uint64_t nonce
)
{
	ethash_return_value_t ret;
	ret.success = true;
	if (!ethash_hash(
		&ret,
		(node const*)full->data,
		NULL,
		full->file_size,
		header_hash,
		nonce)) {
		ret.success = false;
	}
	return ret;
}

void const* ethash_full_dag(ethash_full_t full)
{
	return full->data;
}

uint64_t ethash_full_dag_size(ethash_full_t full)
{
	return full->file_size;
}

The code is now a bit complex comparing to sha_256 functions, functions are longer and they interleave assembler directive.

Remove unecessay tests et precalculated all data

Remember that a test reinitialize the pipeline of the processor because it implies a jump instruction and as a result kills the sequence of instruction. So if we use “constant” or well known values we can drop tests guard test at the start of function. Note it is not necessary to remove assert function. Indeed theses functions are only generated on debug mode, so a good practice is to use assert as replacement of if to test parameters values.
We can also specialize function (see previous post for example) to remove constant parameters. The aim is to create sha_256_32 when size is 32, sha_256_64 when size is 64 and keep a generic function with a parameter when we can not decide what the size is. The counterpart of this method is it increasing the code size, and we have three duplicate code. So the maintenance will be harder. We can do the same with ethash_hash to remove full_nodes parameter and then remove the test of full_nodes != null line 222.

for (unsigned n = 0; n != MIX_NODES; ++n) { node const* dag_node; node tmp_node; ethash_calculate_dag_item(&tmp_node, index * MIX_NODES + n, light); …. } can be changed to

        unsigned preindex = MIX_NODES * index ;
		for (unsigned n = 0; n != MIX_NODES; ++n) {
			node const* dag_node;
			dag_node = &full_nodes[preindex++];

The differnce is slighty but we replace a leal (load effective adress, which access to register, this code is called 64 times so we saved a little bit of time.

Let’s doing a performance test:

./eth -M -t 1 --benchmark-trial 15
cpp-ethereum, a C++ Ethereum client
   04:43:41 PM.112|eth  #00004000…
Benchmarking on platform: 8-thread CPU
Preparing DAG...
Warming up...
    04:43:41 PM.112|miner0  Loading full DAG of seedhash: #00000000…
    04:43:42 PM.008|miner0  Full DAG loaded
Trial 1... 99273
Trial 2... 101280
Trial 3... 102040
Trial 4... 100733
Trial 5... 101026
min/mean/max: 99273/100870/102040 H/s
inner mean: 101013 H/s

Not bad result, it is the first time we exced 100k H/s.

If you want to test you can get the V1.2 tag,

Specific optimization

Right now the code is written in “pure” C, it means that this code can be without too much effort compiled from raspberry to last end Intel processor.

The last step of optimization is now specific to the target. In this phase we will optimize the code for a specific platform, in our example we will use 128 bits register included in SSE2.0, as a counterpart the code will only work on 64 bits Intel/Amd processor.

In this project implementation had been already done with SSE registers but this cannot be able. In fact to enable this generation you have to done two things:

  • Tell the compiler that you want to use SSE instructions
  • Set ENABLE_SSE define to true line 7 in internal.h file

The question you might ask, why this option is not always set to true, and why every modern processor do not use SSE instructions by default ? The answer is not only for historical reasons, but also for performance reason which is at first sight seems contra-intuitive. Indeed if we enable SSE we have more register for us, but switching task will be longer because the processor has to save more registers, and theses registers are essentially used for intensive calculus and are useless on common computer tasks.

 

 

 

Etherminer Optimization

People often ask me what is the best way to optimize code and cope which is the best way to optimize code. The best way to understand how to do that is to take an example. I’m gonna show you how to optimize the  implementation of the ethereum algorithm. This miner has also a very useful command to determine the hashrate. It will help us to know the performance improvement. To help you to follow the process I added tag for the differents steps exposed below.

git clone --recursive https://github.com/fflayol/cpp-ethereum.git
cd cpp-ethereum
mkdir build
cd build
cmake ..; make -j3
cd eth
make
./eth -M -t 1 --benchmark-trial 15

It gives

~/Perso/mod/cpp-ethereum/build/eth$ ./eth -M -t 1 --benchmark-trial 15
cpp-ethereum, a C++ Ethereum client
    03:11:20 PM.445|eth  #00004000…
Benchmarking on platform: 8-thread CPU
Preparing DAG...
Warming up...
    03:11:20 PM.445|miner0  Loading full DAG of seedhash: #00000000…
    03:11:21 PM.438|miner0  Full DAG loaded
Trial 1... 86326
Trial 2... 90166
Trial 3... 91300
Trial 4... 97646
Trial 5... 95880
min/mean/max: 86326/92263/97646 H/s
inner mean: 92448 H/s

The last command give us a reference of performance to see our improvement.

What to optimize

To start optimization we have to know which function last the more. For this purpose we can use valgrind (callgrind).

valgrind --tool=callgrind  ./eth -M -t 1 --benchmark-trial 15

After execution callgrind save a file that you can read with kcachegrind.

 

If we order by execution time, two files are very time consuming .If we focus on sha3.c two functions are very time consuming sha3_512 and sha3_256. If we optimize a bit theses two functions the program itself will be faster. I will now show you different step used to optimize as fast as possible.

Be careful when you use this kind of optimization has several dropdown:

  • Code will become hardly to maintain and to understand. So do theses optimizations on well testing and covered code.
  • To maximize gain you have to be as close as possible on the target so porting optimization from a target to another should be very difficult.

Ensure that call to functions are optimal

Let’s start with sha3.c file

/** libkeccak-tiny
*
* A single-file implementation of SHA-3 and SHAKE.
*
* Implementor: David Leon Gil
* License: CC0, attribution kindly requested. Blame taken too,
* but not liability.
*/
#include "sha3.h"

#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

/******** The Keccak-f[1600] permutation ********/

/*** Constants. ***/
static const uint8_t rho[24] = \
	{ 1,  3,   6, 10, 15, 21,
	  28, 36, 45, 55,  2, 14,
	  27, 41, 56,  8, 25, 43,
	  62, 18, 39, 61, 20, 44};
static const uint8_t pi[24] = \
	{10,  7, 11, 17, 18, 3,
	 5, 16,  8, 21, 24, 4,
	 15, 23, 19, 13, 12, 2,
	 20, 14, 22,  9, 6,  1};
static const uint64_t RC[24] = \
	{1ULL, 0x8082ULL, 0x800000000000808aULL, 0x8000000080008000ULL,
	 0x808bULL, 0x80000001ULL, 0x8000000080008081ULL, 0x8000000000008009ULL,
	 0x8aULL, 0x88ULL, 0x80008009ULL, 0x8000000aULL,
	 0x8000808bULL, 0x800000000000008bULL, 0x8000000000008089ULL, 0x8000000000008003ULL,
	 0x8000000000008002ULL, 0x8000000000000080ULL, 0x800aULL, 0x800000008000000aULL,
	 0x8000000080008081ULL, 0x8000000000008080ULL, 0x80000001ULL, 0x8000000080008008ULL};

/*** Helper macros to unroll the permutation. ***/
#define rol(x, s) (((x) << s) | ((x) >> (64 - s)))
#define REPEAT6(e) e e e e e e
#define REPEAT24(e) REPEAT6(e e e e)
#define REPEAT5(e) e e e e e
#define FOR5(v, s, e)							\
	v = 0;										\
	REPEAT5(e; v += s;)

/*** Keccak-f[1600] ***/
static inline void keccakf(void* state) {
	uint64_t* a = (uint64_t*)state;
	uint64_t b[5] = {0};
	uint64_t t = 0;
	uint8_t x, y;

	for (int i = 0; i < 24; i++) {
		// Theta
		FOR5(x, 1,
				b[x] = 0;
				FOR5(y, 5,
						b[x] ^= a[x + y]; ))
		FOR5(x, 1,
				FOR5(y, 5,
						a[y + x] ^= b[(x + 4) % 5] ^ rol(b[(x + 1) % 5], 1); ))
		// Rho and pi
		t = a[1];
		x = 0;
		REPEAT24(b[0] = a[pi[x]];
				a[pi[x]] = rol(t, rho[x]);
				t = b[0];
				x++; )
		// Chi
		FOR5(y,
				5,
				FOR5(x, 1,
						b[x] = a[y + x];)
				FOR5(x, 1,
				a[y + x] = b[x] ^ ((~b[(x + 1) % 5]) & b[(x + 2) % 5]); ))
		// Iota
		a[0] ^= RC[i];
	}
}

/******** The FIPS202-defined functions. ********/

/*** Some helper macros. ***/

#define _(S) do { S } while (0)
#define FOR(i, ST, L, S)							\
	_(for (size_t i = 0; i < L; i += ST) { S; })
#define mkapply_ds(NAME, S)						\
	static inline void NAME(uint8_t* dst,			\
		const uint8_t* src,						\
		size_t len) {								\
		FOR(i, 1, len, S);							\
	}
#define mkapply_sd(NAME, S)						\
	static inline void NAME(const uint8_t* src,	\
		uint8_t* dst,								\
		size_t len) {								\
		FOR(i, 1, len, S);							\
	}

mkapply_ds(xorin, dst[i] ^= src[i])  // xorin
mkapply_sd(setout, dst[i] = src[i])  // setout

#define P keccakf
#define Plen 200

// Fold P*F over the full blocks of an input.
#define foldP(I, L, F)								\
	while (L >= rate) {							\
		F(a, I, rate);								\
		P(a);										\
		I += rate;									\
		L -= rate;									\
	}

/** The sponge-based hash construction. **/
static inline int hash(uint8_t* out, size_t outlen,
		const uint8_t* in, size_t inlen,
		size_t rate, uint8_t delim) {
	if ((out == NULL) || ((in == NULL) && inlen != 0) || (rate >= Plen)) {
		return -1;
	}
	uint8_t a[Plen] = {0};
	// Absorb input.
	foldP(in, inlen, xorin);
	// Xor in the DS and pad frame.
	a[inlen] ^= delim;
	a[rate - 1] ^= 0x80;
	// Xor in the last block.
	xorin(a, in, inlen);
	// Apply P
	P(a);
	// Squeeze output.
	foldP(out, outlen, setout);
	setout(a, out, outlen);
	memset(a, 0, 200);
	return 0;
}

#define defsha3(bits)													\
	int sha3_##bits(uint8_t* out, size_t outlen,						\
		const uint8_t* in, size_t inlen) {								\
		if (outlen > (bits/8)) {										\
			return -1;                                                  \
		}																\
		return hash(out, outlen, in, inlen, 200 - (bits / 4), 0x01);	\
	}

/*** FIPS202 SHA3 FOFs ***/
defsha3(256)
defsha3(512)

defsha3_256 et defsha3_512 are macro function with a parameter, so the first step here is to “specialize them” in function and the inline them. So the code becomes the following:

inline    int sha3_256(uint8_t* out, size_t outlen, const uint8_t* in, size_t inlen)
{
    if (outlen > 32)
    {
        return -1;
    }
    return hash(out, outlen, in, inlen, 136, 0x01);
}

inline    int sha3_512(uint8_t* out, size_t outlen, const uint8_t* in, size_t inlen)
{
    if (outlen > 64)
    {
        return -1;
    }
    return hash(out, outlen, in, inlen, 72, 0x01);
}
The performance results will be strictly the same , so what is the aim of this optimization ? It shows that sha3_256 and sha3_512 are wrappers to hash function.
This hash function is static, so only called in this file and what is interesting here is that this function is called with one parameter set to 0x01 and another with only two differents values.
So int he first step we can remove delim parameter in  hash function. Why is it important ? If we use constant functions the compiler will easily optimize our code by pre-calculating values, allocation and removing tests.
For instance:
int foo(int size){
  if (size == 0){
     return 0;
  }
  return size +1;
}

main(){
cout<< foo (10)<<endl;
}

In the code upper the test (size ==0) is totally useless, so the compiler can remove the call to foo and replacing it with 11.
Now for our hash function we can remove the delim parameter and the test for rate value, which gives:

   /** The sponge-based hash construction. **/
    static inline int hash(
        uint8_t* out, size_t outlen, const uint8_t* in, size_t inlen, size_t rate)
{
    if ((out == NULL) || ((in == NULL) && inlen != 0))
    {
        return -1;
    }
    uint8_t a[Plen] = {0};
    // Absorb input.
    foldP(in, inlen, xorin);
    // Xor in the DS and pad frame.
    a[inlen] ^= 0x01;
    a[rate - 1] ^= 0x80;
    // Xor in the last block.
    xorin(a, in, inlen);
    // Apply P
    P(a);
    // Squeeze output.
    foldP(out, outlen, setout);
    setout(a, out, outlen);
    memset(a, 0, 200);
    return 0;
}


inline    int sha3_256(uint8_t* out, size_t outlen, const uint8_t* in, size_t inlen)
{
    if (outlen > 32)
    {
        return -1;
    }
    return hash(out, outlen, in, inlen, 136);
}

inline    int sha3_512(uint8_t* out, size_t outlen, const uint8_t* in, size_t inlen)
{
    if (outlen > 64)
    {
        return -1;
    }
    return hash(out, outlen, in, inlen, 72);
}

Surprisingly it is still possible to optimize sha3_512 and sha3_256. If you do a search to know where theses functions are used you’ll find that 256 is always called with outlen set to 32 and for sha3_512 outlen is set to 64. So we can remove this parameter in both functions.

   
    static inline int hash(
        uint8_t* out, size_t outlen, const uint8_t* in, size_t inlen, size_t rate)
{
    if ((out == NULL) || ((in == NULL) && inlen != 0) )
    {
        return -1;
    }
    uint8_t a[Plen] = {0};
    // Absorb input.
    foldP(in, inlen, xorin);
    // Xor in the DS and pad frame.
    a[inlen] ^= 0x01;
    a[rate - 1] ^= 0x80;
    // Xor in the last block.
    xorin(a, in, inlen);
    // Apply P
    P(a);
    // Squeeze output.
    foldP(out, outlen, setout);
    setout(a, out, outlen);
    memset(a, 0, 200);
    return 0;
}


inline    int sha3_256(uint8_t* out, const uint8_t* in, size_t inlen)
{    
    return hash(out, 32, in, inlen, 136);
}

inline    int sha3_512(uint8_t* out, const uint8_t* in, size_t inlen)
{
    return hash(out, 64, in, inlen, 72);
}

You have also to change sha3.h. We’ve arrived to a milestone, I added a tag in git for this first part. To get the version

git checkout V1.1

Now it is time to see the results:

~/Perso/mod/cpp-ethereum/build/eth$ ./eth -M -t 1 --benchmark-trial 15
cpp-ethereum, a C++ Ethereum client
    03:02:13 PM.558|eth  #00004000…
Benchmarking on platform: 8-thread CPU
Preparing DAG...
Warming up...
    03:02:13 PM.558|miner0  Loading full DAG of seedhash: #00000000…
    03:02:14 PM.476|miner0  Full DAG loaded
Trial 1... 98380
Trial 2... 98653
Trial 3... 96666
Trial 4... 97993
Trial 5... 97900
min/mean/max: 96666/97918/98653 H/s
inner mean: 98091 H/s

The results are quite good (98091 vs 92448 106 percent faster). Honestly as we do not use the same input I think that the increasing is more like 104%.

So why by modifying and simplifying calls to function we have a such gain ? The reason is that modern processors do not like functions calls, they get their best performance when instructions are sequential. It allows the processor to re-arrange instructions and execute several in parallel.

Validation

After all theses modifications you must launch tests to ensure you didn’t broke anything. If the tests cover a good part of the code, it will guarantee you that your modifications didn’t break anything.

cd build
make test

Conclusion

We showed that with two hours of work,even on latest compiler optimization there’s still a way to optimize code without too much effort and in this case without compromising the readability of the code. In the next post it won’t be the case 🙂

Create your personal Web hosting 3

Up to now we have an inexpensive system that can answer to http request.

But the ip we use, is a private one (means it start by 192.168.*). And I’m quite sure that a thousand persons worldwide use the same private. It’s not a problem as far as your using it in your private area, but if you want everybody to access from Internet to your Olimex Box, you have to find a solution.

The only device that have a public adress is your Box, and when any of your device wants to communicate, the Box forward the trafic to your device, and replace its adress by the private one.

So you are facing several problems, and solution might depending on your internet provider.

But you must do theses steps:

ensure that your web server has always the same Ip (private IP). It will help to always have the same rule to

 

Create your personal Web hosting 2

Now we know that Apache can handle a sufficient number of request, the next step is to test with a real running server. For this purpose I will use WordPress.
WordPress needs a database, but the box use a sdcard as a mass storage, which is not recommended and should be avoid. To handle this problem I will use an USB port to plug a USB key or later a hd disk if needed. To easily switch I will do an logical link in the system to change the target of the mass storage. Insure that you install a raid solution (see http://olinuxino.4pro-web.com/how-to-speedup-access/)

apt-get install mysql-server
mysql_secure_installation
apt-get install php5  php5-mysql
apachectl restart

Now download wordpress

cd /mnt/raid
wget https://wordpress.org/latest.zip
unzip latest.zip
chmod  -R 755 wordpress
chown -R www-data
String link = "http://www.megatome.com";
System.out.println("Hello World");
System.out.println("Goodbye World");

Now it’s time to install wordpress

Edit the file /etc/apache2/sites-enabled/000-default and change the value DocumentRoot to your external hd ( /mnt/raid/wordpress in my case)
Then go to your laptop and in your web browser type your box ip adress (192.168.1.71 in my case).

Be careful if you have errors like can not insert or create databases, ensure that the mass storage have 755 authorization?

Now if you succeed installed wordpress, from your laptop, you have a site in 192.168.1.71

Now we gonna tweak apache. As I have already said sd card are not reliable. So we gonna put logs from apache to memory. The log will be

accesed like a filesystem, but will be stored in memory and will deleted if you reboot your box. Nevertheless you can add a script to store at specific time.
Debian has a specific filesystem named tmpfs, so my /etc/fstab looks like:

1
2
3
4
5
6
7
8
9
10
11
12
13
# /etc/fstab: static file system information.
#
# <file system> <mount point>   <type>  <options>       <dump>  <pass>
/dev/root      /               ext4    noatime,errors=remount-ro 0 1
tmpfs    /run/shm    tmpfs    defaults    0 0
tmpfs    /tmp        tmpfs    defaults    0 0
tmpfs    /var/tmp    tmpfs    defaults    0 0
tmpfs    /var/log/apache2    tmpfs    defaults,size=50M    0 0
/dev/nanda   /media/nand auto    rw,user,noauto,exec   0   0
#/dev/sda1   /media/hdd auto    rw,user,noauto,exec   0   0
/dev/sda1       /mnt/sdb1           ext4    defaults        0       0 
/dev/sdb1       /mnt/sdb2           ext4    defaults        0       0
/dev/md0 	/mnt/raid	ext2	defaults 	0	1

Now we have the same problem with the mysql database, so I will move it into the Usb key (raid1 in our case).
To do so:

1
2
3
4
5
6
service mysql stop 
mv /var/lib/mysql /var/lib/mysql-old
mkdir /mnt/raid/mysql
ln -s /mnt/raid/mysql /var/lib/mysql
cp -pr /var/lib/mysql-old/* /mnt/raid/mysql 
/etc/init.d/mysqld start
1
ab -n 400 -c 50 -g test_data_1.txt http://192.168.1.71/
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Document Path:          /
Document Length:        251 bytes
 
Concurrency Level:      50
Time taken for tests:   30.235 seconds
Complete requests:      400
Failed requests:        0
Non-2xx responses:      400
Total transferred:      242800 bytes
HTML transferred:       100400 bytes
Requests per second:    13.23 [#/sec] (mean)
Time per request:       3779.398 [ms] (mean)
Time per request:       75.588 [ms] (mean, across all concurrent requests)
Transfer rate:          7.84 [Kbytes/sec] received

Result are not very good. Indeed we can serve only 13rq/s. So let’s try to optimize:

sudo apt-get install php-apc 
vi /etc/php5/apache2/php.ini
add extension=apc.so at the end of the file

Add WP Super Cache to your WordPress installation and then we have a fully functional site.

At conclusion we can support 50rq/s

our environment can support a big number of requests,son in the last part of these articles I will show you how to put your website on the web.

ab -n 400 -c 50 -g test_data_1.txt http://5.49.77.158/
Server Software:        Apache/2.2.22
Server Hostname:        5.49.77.158
Server Port:            80
 
Document Path:          /
Document Length:        9287 bytes
 
Concurrency Level:      50
Time taken for tests:   4.262 seconds
Complete requests:      400
Failed requests:        0
Total transferred:      3836400 bytes
HTML transferred:       3714800 bytes
Requests per second:    93.84 [#/sec] (mean)
Time per request:       532.798 [ms] (mean)
Time per request:       10.656 [ms] (mean, across all concurrent requests)
Transfer rate:          878.96 [Kbytes/sec] received

How to set up a raid 1 filesystem

As I previously, written I’m trying to use the Olimex A20 as a web server and using a Usb key as mass storage.But you might not forget that an usb device is not a very reliable device, so I will show you how to set a raid system that will improve reliability.
A Raid system use at least two device and copy data on both of the (at least for Raid 1). It means that if one of them broke, you have a backup.
But I do not have idea, how it can impacts performance, it’s the subject of this article.

A friend of mine, gave me two inexpensive usb key, so I wll test them.

To have a reference, I test my two usb keys.

For my first key

For small file

dd bs=1K count=512000 if=/dev/zero of=test conv=fdatasync 
512000+0 records in 
512000+0 records out 524288000 bytes (524 MB) copied, 82.303 s, 6.4 MB/s

For big file (1M)

dd bs=1M count=4120 if=/dev/zero of=test conv=fdatasync
3720+0 records in 
3719+0 records out 3900125184 bytes (3.9 GB) copied, 578.207 s, 6.7 MB/s

For my second key

root@a20-olimex:/mnt/sdb1# dd bs=1M count=3900 if=/dev/zero of=test conv=fdatasync 
3720+0 records in 3719+0 records out 3900588032 bytes (3.9 GB) copied, 647.667 s, 6.0 MB/s 
root@a20-olimex:/mnt/sdb1# dd bs=1K count=812000 if=/dev/zero of=test conv=fdatasync 
812000+0 records in 
812000+0 records out 831488000 bytes (831 MB) copied, 140.967 s, 5.9 MB/s

It my second key seems a bit slower
Now we can create a raid 0 to see if it improves speed
There’s a special command to create a raid, it’s name is mdadm.

apt-get install mdadm
umount /dev/sda1
umount /dev/sdb1
sudo mdadm --create /dev/md0 --level=1 --assume-clean --raid-devices=2 /dev/sda /dev/sdb

If you do

Now /dev/md0 is seen as a normal partition, now we just have to create a partition

Edit /etc/fstab and add the folowing line

/dev/md0 /media/raid ext2 defaults 0 1
mount -a

And normally you have a new mounted partion under /mnt/raid, wich « concatenate » sda et sdb wich capacity is lowest capacity

Now check the speed of the raid:

cd /mnt/raid
dd bs=1K count=512000 if=/dev/zero of=test conv=fdatasync
512000+0 records in
512000+0 records out
524288000 bytes (524 MB) copied, 171.296 s, 3.1 MB/s
root@a20-olimex:/mnt/raid# dd bs=1M count=2900 if=/dev/zero of=test conv=fdatasync
2900+0 records in
2900+0 records out
3040870400 bytes (3.0 GB) copied, 977.195 s, 3.1 MB/s

This result is normal, in raid1 data are copied on both usb key so performance divided by two is what we should expect.
Now we have several choice:
keeping good performance but have something less reliable. You have to know that if you plan to keep your A20 24 hours a day, your gonna facing because your hardware will be warm for a long time. For my point a view, if your not interested in performance, a USB raid can be a solution, but you can keep the solution with Hd disk if you want. In this case you gonna have better performance. But the in practical it is the same manipulation.

As a result you have something like that:

Create your personal Web hosting 1

As some of my personal/professional web hosting are to renewed, I was thinking of moving them to my Olimex A20.

Reasons why I think the A20 can be a solution:

  • low power consumption , so I can leave it on
  • my site have a small audience so performance is not a bottleneck ( I’m not sure a mutual hosting have better performance)
  • I can control exactly services started, can be very formative and challenging
  • Spend my money to buy something that I can use later instead of renting something
  • For the price of one year of web hosting I can buy a new A20
  • It’s challenging and I might learn a lot of things/li>

The other side of the coin

  • Very difficult to configure (you have to configure a lot of thing DNS,Apache,open your internet connection)
  • You are now responsible for the security of your system (you’re opening your system to Internet)

Starting from fresh installation

To decide first I must have an idea of the performance of an installed apache and I will remove all unused things (X-windows for instance)
Removing unecessay modukes and services
I started from a fresh installation and configure the network as I explained on a previous article.
I removed unnecessary X-windows lib, the box will start faster and I can keep data on the sd card.

Removing unnecessary modules

apt-get remove --auto-remove --purge libx11-.*
apt-get autoremove --purge

Before:

 /dev/root 3808912 2634612 980816 73% /

After that:

free -m
total used free shared buffers cached
Mem: 874 63 811 0 2 34
-/+ buffers/cache: 26 848
Swap: 0 0 0

I know I can win more memory by switching to mono-user and allow only one terminal. But I will do it in a later version.

You can also remove unnecessary modules:

lsmod

Module Size Used by
cpufreq_powersave 1207 0
cpufreq_userspace 3318 0
cpufreq_conservative 6042 0
cpufreq_stats 3699 0
g_ether 55821 0
pwm_sunxi 9255 0
gt2005 13408 0
nand 114172 0
sun4i_keyboard 2150 0
ledtrig_heartbeat 1370 0
leds_sunxi 3733 0
led_class 3539 1 leds_sunxi
sunxi_emac 34009 0
sunxi_gmac 29505 0
8192cu 454131 0

free -m
total used free shared buffers cached
Mem: 874 63 811 0 2 34
-/+ buffers/cache: 26 848
Swap: 0 0 0

I can remove modules that handle video memory:
After that:
rmmod sun4i_csi0 videobuf_dma_contig videobuf_core ump lcd sunxi_cedar_mod gt2005
free -m

total used free shared buffers cached
Mem: 874 62 812 0 2 34
-/+ buffers/cache: 25 849
Swap: 0 0 0

<h3>Raw Performance</h3>
By default apache is installed, to test do
<div class=”wp_syntax”>
<table>
<tbody>
<tr>
<td class=”line_numbers”>
<pre>1

wget -O- http://192.168.1.171/

He will fetch index.html in /var/www

First step I try to know how much request this apache can handle. From my laptop I use a very useful command named ab that call concurent request and report time and how many request handled. From my console I type

1

ab -n 4000 -c 100 -g test_data_1.txt http://192.168.1.71/

with 4000 means how many request to use, 100 means number of client, test_data_1.tx is the gnuplot output and you can put the url you want to test.

1
2
3
4
5
6
7
8
9
10
11
12
Server Software:        Apache/2.2.22
Document Length:        999 bytes
Concurrency Level:      100
Time taken for tests:   3.426 seconds
Complete requests:      4000
Failed requests:        0
Total transferred:      5104000 bytes
HTML transferred:       3996000 bytes
<strong>Requests per second:    1167.60 [#/sec] (mean)</strong>
Time per request:       85.646 [ms] (mean)
Time per request:       0.856 [ms] (mean, across all concurrent requests)
Transfer rate:          1454.94 [Kbytes/sec] received

To test in a more real example I did on the box

1
2
cd /var/www
wget -r -O  http://olinuxino.4pro-web.com/

From my laptop

1
ab -n 400 -c 100 -g test_data_1.txt http://192.168.1.71/olinuxino.4pro-web.com/index.html

SELRES_0.5579080039592403SELRES_0.5579080039592403

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Document Path:          /olinuxino.4pro-web.com/index.html
Document Length:        48352 bytes
 
Concurrency Level:      100
Time taken for tests:   1.756 seconds
Complete requests:      400
Failed requests:        0
Total transferred:      19452800 bytes
HTML transferred:       19340800 bytes
<strong>Requests per second:    227.79 [#/sec] (mean)</strong>
Time per request:       438.995 [ms] (mean)
Time per request:       4.390 [ms] (mean, across all concurrent requests)
Transfer rate:          10818.39 [Kbytes/sec] received
 
It looks like that the A20, can support the number of request I expect.
In the next article I will test with a running wordpress site, and I will explain how to configure and optimize Apache and put your website on Internet.

Tricks to know

1)How to use Xwindow outside Olinuxino

connect to the Olinuxino with

<code class=" code-embed-code language-bash">ssh <span class="token operator">-</span>X root@ip_box</code>

After connection do

<code class=" code-embed-code language-bash">xhost <span class="token operator">+</span></code>

Now you can launch any X application directly on the box

For instance you can do:

synaptic

Olinuxino goes to hadoop

One application I would like to test is to know how Olinuxino goes if you using it in an hadoop environment.

At first sight it might not be a very good solution. Indeed it doesn’t have a very big storage habilitity and you can not think of using a smartmedia card as a normal filesystem.

The solution I propose to test is to use a laptop (or another machine) as a master (and hdfs filesystem). I would like to know if filesystem is a bottleneck or if it is the network.

Step 1:Install an hadoop server

In my case it is my laptop and I use a 30Go partion on my sd disk. The IP I will use for my server is 192.168.1.7. 

This step will install a « normal » hadoop distribution on a PC and will be used as

To simplify the best thing is to add alias in /etc/hosts

Step 2: setting up the board

Download standard image fot olinuxino. It can be found here: https://www.olimex.com/wiki/images/2/29/Debian_FS_34_90_camera_A20-olimex.torrent taken from the official olinuxino github: (https://github.com/OLIMEX/OLINUXINO/tree/master/SOFTWARE/A20/A20-build).

First problem is that by default network is not enabled. Change the file /etc/network/interfaces and add

auto eth0
allow-hotplug eth0
iface eth0 inet dhcp

Then type:

sudo dhclient eth0
/etc/init.d/networking restart

Get your inetaddress on the board by typing

ifconfig:
    eth0      Link encap:Ethernet  HWaddr 02:cf:07:01:5a:b7

    inet addr:192.168.1.254  Bcast:192.168.1.255  Mask:255.255.255.248
    sudo apt-get update
    sudo apt-get upgrade
    sudo apt-get install ssh

edit /etc/ssh/sshd_config

and ensure that the line « PermitRootLogin yes » and « StrictModes no » line has theses values

/etc/init.d/ssh restart

To test if everything is correctly setup go to your server computer and type (password by defaut for root is olimex)

ssh 192.168.1.254 -l root

Try also to connect to your server (in my case adress is 192.168.1.75 and I create an account names local)

ssh 192.168.1.75 -l local

Step 1: Adding a User

sudo addgroup hadoop_group
sudo adduser --ingroup hadoop_group hduser1
sudo adduser hduser1 sudo
su – hduser1
vi ~/.bashrc
# add the folowing line 
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-armhf/jre/
export HADOOP_HOME=/home/hduser1/hadoop
export MAHOUT_HOME=/home/hduser1/hadoop/mahout

ssh-keygen -t rsa -P ""
cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

Step 2: installing hadoop

 

sudo aptitude install openjdk-7-jre

wget http://apache.mirrors.ovh.net/ftp.apache.org/dist/hadoop/common/hadoop-2.7.0/hadoop-2.7.0.tar.gz

tar zxvf hadoop-2.7.0.tar.gz

mv hadoop-2.7.0/hadoop hadoop

Now we have to modify the configuration file to acces to the masternode and the hdfs

Edit hadoop-env.sh

set line export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-armhf/jre

edit the file core-site.xml
and add

    <property>
    <name>hadoop.tmp.dir</name>
    <value>/tmp</value>
    <description>A base for other temporary directories.</description>
    </property>

    <property>
    <name>fs.default.name</name>
    <value>hdfs://192.168.1.75:54310</value>
    <description>The name of the default file system. A URI whose
    scheme and authority determine the FileSystem implementation. The
    uri's scheme determines the config property (fs.SCHEME.impl) naming
    the FileSystem implementation class. The uri's authority is used to
    determine the host, port, etc. for a filesystem.</description>
    </property>

Now you can tst if evrything works:
hadoop fs -ls

Step 3: installing mahout

One easy way to test hadoop is to install mahout. This project includes several hadoop jobs to classify data. So we will using it for test purpose.

Download it from apache:

cd hadopp
wget http://apache.mirrors.ovh.net/ftp.apache.org/dist/mahout/0.10.0/mahout-distribution-0.10.0.tar.gz
tar zxvf mahout-distribution-0.10.0.tar.gz
mv mahout-distribution-0.10.0 mahout

Now everything should be corectly set.

Step 4: Benchmarking

Go into hadoop/mahout

do: sudo apt-get install curl examples/bin/classify-wikipedia.sh

Now to bench do

time examples/bin/classify-wikipedia.sh

How to correctly set up a toolchain

Creating a tool-chain on you PC to compile for ARM, can a good idea and so for several reasons:

  • Your PC is far faster than your Box, and then you can compile all the system in several minutes whereas you would need more than half an hour.
  • You can easily automatize a night build with the latest source to test

But it is not so easy:

You have to compile for a processor than is not necessary yours (from Intel to ARM or GPU for instance) and you would also need to install native library. In this case you have to have your native libraries ans those for the ARM. In this case you cannot use package to install library, you have to do it « by hand »

To easily set up a toolchain you have to:

Install a gcc compiler for ARM. There’s a good version embedded in Ubuntu, and this is the easier part:

sudo apt-get update

sudo apt-get install arm-linux-gnueabi-cpp

For cross library like libc (which is the most important library for compilation)