SSE usage Tutorial

Hi all,

SSE usage is a bit tricky

You have to see see registers as vectorial not a linear and their size is depending of the context.

For instance xmm0 (a SSE 128 bytes register)can be seen as 2*64 bits register or 4*32 bits register or 8*16  bits register or 16*8 bits register.

So what is the aim of this ?

If you want to add two arrays the algorithm wil be

int x [4] ; //let's say that sizeof(int) =32
int y[4];
int z[4];

c[0]=a[0]b[0]
c[1]=a[1]+b[1]
c[2]=a[2]+b[2]
c[3]=a[3]+b[3]

In assembler it gives :

       .loc 1 6 0
        mov     edx, DWORD PTR [rbp-48]
        mov     eax, DWORD PTR [rbp-32]
        add     eax, edx
        mov     DWORD PTR [rbp-16], eax
        .loc 1 7 0
        mov     edx, DWORD PTR [rbp-44]
        mov     eax, DWORD PTR [rbp-28]
        add     eax, edx
        mov     DWORD PTR [rbp-12], eax
        .loc 1 8 0
        mov     edx, DWORD PTR [rbp-40]
        mov     eax, DWORD PTR [rbp-24]
        add     eax, edx
        mov     DWORD PTR [rbp-8], eax
        .loc 1 9 0
        mov     edx, DWORD PTR [rbp-36]
        mov     eax, DWORD PTR [rbp-20]
        add     eax, edx
        mov     DWORD PTR [rbp-4], eax
        mov     eax, 0

For instance c[0] = a[0]+b[0] is generated like that:

       .loc 1 6 0
        mov     edx, DWORD PTR [rbp-48] ; a[0]
        mov     eax, DWORD PTR [rbp-32] ; b[0]
        add     eax, edx ;eax <- a[0]+b[0]
        mov     DWORD PTR [rbp-16], eax ; c[0] = eaz

And we are doing that 4 times. But thanks to the SSE extension operator we can do it with less instructions

The Streaming SIMD Extensions enhance the x86 architecture in four ways:

  1. 8 new 128-bit SIMD floating-point registers that can be directly addressed;
  2. 50 new instructions that work on packed floating-point data;
  3. 8 new instructions designed tocontrol cacheability of all MMX and 32-bit x86 data types, including the ability to stream data to memory without polluting the caches, and to prefetch data before it is actually used;
  4. 12 new instructions that extend the instruction set.

This set enables the programmer to develop algorithms that can mix packed, single-precision, floating-point and integer using both SSE and MMX instructions respectively.

Intel SSE provides eight 128-bit general-purpose registers, each of which can be directly addressed using the register names XMM0 to XMM7. Each register consists of four 32-bit single precision, floating-point numbers, numbered 0 through 3.

SSE instructions operate on either all or the least significant pairs of packed data operands in parallel. The packed instructions (with PS suffix) operate on a pair of operands, while scalar instructions (with SS suffix) always operate on the least significant pair of the two operands; for scalar operations, the three upper components from the first operand are passed through to the destination.

There are two ways to use SSE registers

Scalar the same 4 instructions on 4 datas

Packed

(thanks to Stefano Tommesani)

So let’s return to our code. I think you gonna understand where I want to go. I we fill two registers with 4 values (a[0]..a[3]) in one register and (c[0]..c[3]), add them together and put the result in a third register. With this solution we will do only one addition.

#include 
#include 
#include 

void p128_hex_u8(__m128i in) {
    uint8_t v[16];
    _mm_store_si128((__m128i*)v, in);
    printf("v16_u8: %x %x %x %x | %x %x %x %x | %x %x %x %x | %x %x %x %xn",
           v[0], v[1],  v[2],  v[3],  v[4],  v[5],  v[6],  v[7],
           v[8], v[9], v[10], v[11], v[12], v[13], v[14], v[15]);
}
 
void p128_hex_u16(__m128i in) {
    uint16_t v[8];
    _mm_store_si128((__m128i*)v, in);
    printf("v8_u16: %x %x %x %x,  %x %x %x %xn", v[0], v[1], v[2], v[3], v[4], v[5], v[6], v[7]);
}
 
void p128_hex_u32(__m128i in) {
    uint32_t v[4] __attribute__((aligned (16)));
    _mm_store_si128((__m128i*)v, in);
    printf("v4_u32: %x %x %x %xn", v[0], v[1], v[2], v[3]);
}
 
void p128_dec_u32(__m128i in) {
    uint32_t v[4] __attribute__((aligned (16)));
    _mm_store_si128((__m128i*)v, in);
    printf("v4_u32: %d %d %d %dn",(uint32_t) v[0], (uint32_t) v[1], (uint32_t)v[2],(uint32_t) v[3]);
}

void p128_hex_u64(__m128i in) {
    long long v[2];  // uint64_t might give format-string warnings with %llx; it's just long in some ABIs
    _mm_store_si128((__m128i*)v, in);
    printf("v2_u64: %llx %llxn", v[0], v[1]);
}
 
 
int main(){
uint32_t a [4] ={1,2,3,4}; //let's say that sizeof(int) = 32
uint32_t b[4] = {11,12,13,14};
uint32_t c[4];
 
c[0]=a[0]+b[0];
c[1]=a[1]+b[1];
c[2]=a[2]+b[2];
c[3]=a[3]+b[3];
printf("Result %d %d %d %dn",c[0],c[1],c[2],c[3]);
 
     __m128i a1 = _mm_set_epi32(a[3], a[2], a[1], a[0]);
     __m128i b1 = _mm_set_epi32(b[3], b[2], b[1], b[0]);
     __m128i c1 = _mm_add_epi32(a1, b1);
     p128_dec_u32(a1);
     p128_dec_u32(b1);
     p128_dec_u32(c1);
}

This a very simple example, as your compiler can already optimize your code with this

Leave a Reply