Friday, December 9, 2016

the horizontal sum - float

While we would like to do things in parallel.  But sometimes we need to combine the elements of a single SSE2 register.

The horizontal sum is one of the most fundamental of these combinations.

Unfortunately the instruction set works against us.  Until SSE3 there wasn't an instruction to combine the elements of a single __m128 register.

float horizontalSum_SSE2(const __m128 &mABCD)
{
    __m128 mCDCD = _mm_movehl_ps(mABCD, mABCD);
    __m128 mApCBpD = _mm_add_ps(mABCD, mCDCD);
    __m128 mBpD = _mm_shuffle_ps(mApCBpD, mApCBpD, 0x55);
    __m128 mApBpCpD = _mm_add_ps(mApCBpD, mBpD);
    return _mm_cvtss_f32 (mApBpCpD);
}


This is compiled to:

movaps      xmm1,xmm0 
movhlps     xmm1,xmm0 
addps       xmm1,xmm0 
movaps      xmm0,xmm1 
shufps      xmm0,xmm1,55h 
addps       xmm0,xmm1 

This is pretty good.

No comments:

Post a Comment