MicroPerf: the horizontal sum

Friday, December 9, 2016

the horizontal sum - float

While we would like to do things in parallel. But sometimes we need to combine the elements of a single SSE2 register.

The horizontal sum is one of the most fundamental of these combinations.

Unfortunately the instruction set works against us. Until SSE3 there wasn't an instruction to combine the elements of a single __m128 register.

float horizontalSum_SSE2(const __m128 &mABCD)
{
    __m128 mCDCD = _mm_movehl_ps(mABCD, mABCD);
    __m128 mApCBpD = _mm_add_ps(mABCD, mCDCD);
    __m128 mBpD = _mm_shuffle_ps(mApCBpD, mApCBpD, 0x55);
    __m128 mApBpCpD = _mm_add_ps(mApCBpD, mBpD);
    return _mm_cvtss_f32 (mApBpCpD);
}

This is compiled to:

movaps      xmm1,xmm0
movhlps     xmm1,xmm0
addps       xmm1,xmm0
movaps      xmm0,xmm1
shufps      xmm0,xmm1,55h
addps       xmm0,xmm1
This is pretty good.

MicroPerf

Friday, December 9, 2016

the horizontal sum - float

No comments:

Post a Comment