While we would like to do things in parallel. But sometimes we need to combine the elements of a single SSE2 register.
The horizontal sum is one of the most fundamental of these combinations.
Unfortunately the instruction set works against us. Until SSE3 there wasn't an instruction to combine the elements of a single __m128 register.
float horizontalSum_SSE2(const __m128 &mABCD)
{
__m128 mCDCD = _mm_movehl_ps(mABCD, mABCD);
__m128 mApCBpD = _mm_add_ps(mABCD, mCDCD);
__m128 mBpD = _mm_shuffle_ps(mApCBpD, mApCBpD, 0x55);
__m128 mApBpCpD = _mm_add_ps(mApCBpD, mBpD);
return _mm_cvtss_f32 (mApBpCpD);
}
This is compiled to:
movaps xmm1,xmm0
movhlps xmm1,xmm0
addps xmm1,xmm0
movaps xmm0,xmm1
shufps xmm0,xmm1,55h
addps xmm0,xmm1
This is pretty good.
No comments:
Post a Comment