Saturday, April 22, 2017

Cumulative sum of the float elements in an array using SSE

Summing all of the elements is pretty simple with SSE.  And the compiler does a pretty good job of vectorizing a simple sum automatically.
I've had some cases where the input is the distance between adjacent points, and we'd like to create an array with the cumulative distance up to a particular point.

float CumulativeSum(__in_ecount(cElements) const float * const aIn, __out_ecount(cElements) float * const aOut, size_t cElements, const float fRunningSumIn=0)
{
 float fRunningSum = fRunningSumIn;
 for (size_t iOnElement = 0; iOnElement < cElements; iOnElement++)
 {
  aOut[iOnElement] = (fRunningSum += aIn[iOnElement]);
 }
 return fRunningSum;
}


This does require a bunch of horizontal sums.  But that's okay.

template<const unsigned char nElementsToShift>
__m128 _mm_slli_ps(__m128 m)
{
 return _mm_castsi128_ps(_mm_slli_si128(_mm_castps_si128(m), 4 * nElementsToShift));
}

__m128 SumRegister(__m128 mIn)
{
 __m128 mAddShift = _mm_add_ps(mIn, _mm_slli_ps<1>(mIn));
 return _mm_add_ps(mAddShift, _mm_slli_ps<2>(mAddShift));
}
float CumulativeSum_SSE(const float * aIn, float * aOut, size_t cElements, float fRunningSum = 0)
{
 size_t cOnElement = cElements & 3;
 fRunningSum = CumulativeSum(aIn, aOut, cOnElement, fRunningSum);

 __m128 mRunningSum = _mm_set1_ps(fRunningSum);
 while (cOnElement<cElements)
 {
  __m128 mIn = _mm_loadu_ps(aIn + cOnElement);
  __m128 mOut = _mm_add_ps(mRunningSum, SumRegister(mIn));
  _mm_storeu_ps(aOut + cOnElement, mOut);
  mRunningSum = _mm_shuffle_ps(mOut, mOut, _MM_SHUFFLE(3, 3, 3, 3));
  cOnElement += 4;
 }
 return _mm_cvtss_f32(mRunningSum);
}


The loop could be unrolled a bit, but the benefits would be small.
We will be using this in a little bit.

No comments:

Post a Comment