Any horizontal processing is difficult in SSE. But it was made easier with two updates that happened, confusingly named SSE3 and SSSE3.
This allows for more compact versions of the horizontal sums needed at the end of dot products and such.
Each of the horizontal add instructions, including _mm_hadd_epi32 and _mm_hadd_ps, horizontally adds two adjacent elements of two parameters. Since we are looking for the sum of all four elements, this will require two calls to the instruction. The contribution of the second parameter to the result is ignored. But we don't want to involve an extra register, so we pass the same value in twice in each of the calls.
__forceinline float horizontalSum_SSE3(const __m128 &mABCD)
{
__m128 mApB_CpD= _mm_hadd_ps(mABCD, mABCD);
__m128 mApBpCpD = _mm_hadd_ps(mApB_CpD, mApB_CpD);
return _mm_cvtss_f32(mApBpCpD);
}
__forceinline int horizontalSum_SSSE3(const __m128i &mABCD)
{
__m128i mApB_CpD = _mm_hadd_epi32(mABCD, mABCD);
__m128i mApBpCpD = _mm_hadd_epi32(mApB_CpD, mApB_CpD);
return _mm_cvtsi128_si32(mApBpCpD);
}
There are getting to be a lot of different combinations of instruction sets. We will have to address this difficulty in an upcoming post.
No comments:
Post a Comment