Summing all of the elements is pretty simple with SSE. And the compiler does a pretty good job of vectorizing a simple sum automatically.
I've had some cases where the input is the distance between adjacent points, and we'd like to create an array with the cumulative distance up to a particular point.
float CumulativeSum(__in_ecount(cElements) const float * const aIn, __out_ecount(cElements) float * const aOut, size_t cElements, const float fRunningSumIn=0)
{
float fRunningSum = fRunningSumIn;
for (size_t iOnElement = 0; iOnElement < cElements; iOnElement++)
{
aOut[iOnElement] = (fRunningSum += aIn[iOnElement]);
}
return fRunningSum;
}
This does require a bunch of horizontal sums. But that's okay.
template<const unsigned char nElementsToShift>
__m128 _mm_slli_ps(__m128 m)
{
return _mm_castsi128_ps(_mm_slli_si128(_mm_castps_si128(m), 4 * nElementsToShift));
}
__m128 SumRegister(__m128 mIn)
{
__m128 mAddShift = _mm_add_ps(mIn, _mm_slli_ps<1>(mIn));
return _mm_add_ps(mAddShift, _mm_slli_ps<2>(mAddShift));
}
float CumulativeSum_SSE(const float * aIn, float * aOut, size_t cElements, float fRunningSum = 0)
{
size_t cOnElement = cElements & 3;
fRunningSum = CumulativeSum(aIn, aOut, cOnElement, fRunningSum);
__m128 mRunningSum = _mm_set1_ps(fRunningSum);
while (cOnElement<cElements)
{
__m128 mIn = _mm_loadu_ps(aIn + cOnElement);
__m128 mOut = _mm_add_ps(mRunningSum, SumRegister(mIn));
_mm_storeu_ps(aOut + cOnElement, mOut);
mRunningSum = _mm_shuffle_ps(mOut, mOut, _MM_SHUFFLE(3, 3, 3, 3));
cOnElement += 4;
}
return _mm_cvtss_f32(mRunningSum);
}
The loop could be unrolled a bit, but the benefits would be small.
We will be using this in a little bit.
No comments:
Post a Comment