MicroPerf: Finding the index of the maximum value in an array

A bunch of linear algorithms can be sped with SSE. In general SSE allows operations to happen in groups in the time it would normally take a single operation. But the more complicated loop structures will generally mean that the SSE version will be only twice as fast as the Ansi version.

A common problem that comes up is to find the maximum or minimum value in an array.
Here's the Ansi version:

const int * findLargestInt_Ansi(const int * const pA, const UINT cAOrig)
{
    if (cAOrig<=0)
    {
        return nullptr;
    }
    UINT cARemaining = cAOrig;
    cARemaining--;
    const int *pBest = pA + cARemaining;
    int iBestVal = *pBest;
    while (cARemaining > 0)
    {
        cARemaining--;
        int iOnVal = pA[cARemaining];
        if (iOnVal > iBestVal)
        {
            pBest = pA + cARemaining;
            iBestVal = iOnVal;
        }
    }
    return pBest;
}
This can be made faster.

We often want to simulate conditionals in SSE. This will generally be implemented with masking, also known as blending. This generally takes the form of a bitwise

Result = A * C + B * !C

In SSE4.1 this is directly supported with the blend instructions. For SSE2 we have to implement from more basic instructions.
We implement this bitwise blend with
__forceinline __m128i _mm_blend_si128_SSE2(const __m128i &mA, const __m128i &mB, const __m128i &mControl)
{
    return _mm_or_si128(_mm_and_si128(mControl, mA), _mm_andnot_si128( mControl, mB));
}

This as a side effect allows us to change specific elements inside of an XMM register.

This technique is used below:

const int * findLargestInt_SSE2(const int *pA, UINT cAOrig)
{
    if (cAOrig >= 4)
    {
        UINT uiA = cAOrig;
        __m128i mBestValues = _mm_loadu_si128((const __m128i *)pA);
        __m128i mBestIndexBase = _mm_setzero_si128();
        while (uiA > 4)
        {
            uiA-=4;
            __m128i mValues = _mm_loadu_si128((const __m128i *)&pA[uiA]);
            __m128i mCompare = _mm_cmpgt_epi32(mValues, mBestValues);
            if (_mm_movemask_epi8(mCompare))
            {
                __m128i mIndexBase = _mm_set1_epi32(uiA);
                mBestValues = _mm_blend_si128_SSE2(mValues, mBestValues, mCompare);
                mBestIndexBase = _mm_blend_si128_SSE2(mIndexBase, mBestIndexBase, mCompare);
            }
        }
        __m128i mBestIndex = _mm_add_epi32(mBestIndexBase, _mm_setr_epi32(0, 1, 2, 3));
        __m128i mValues, mCompare;
        mValues = _mm_shuffle_epi32(mBestValues, _MM_SHUFFLE(2,3,2,3));
        mCompare = _mm_cmpgt_epi32(mValues, mBestValues);
        mBestValues = _mm_blend_si128_SSE2(mValues, mBestValues, mCompare);
        mBestIndex = _mm_blend_si128_SSE2(_mm_shuffle_epi32(mBestIndex, _MM_SHUFFLE(2, 3, 2, 3)), mBestIndex, mCompare);
        mValues = _mm_shuffle_epi32(mBestValues, _MM_SHUFFLE(1, 1, 1, 1));
        mCompare = _mm_cmpgt_epi32(mValues, mBestValues);
        // mBestValues = _mm_blend_si128_SSE2(mValues, mBestValues, mCompare);
        mBestIndex = _mm_blend_si128_SSE2(_mm_shuffle_epi32(mBestIndex, _MM_SHUFFLE(1,1,1,1)), mBestIndex, mCompare);
        return pA + _mm_cvtsi128_si32(mBestIndex);
    }
    return findLargestInt_Ansi(pA,cAOrig);
}

As we are iterating through the array, we maintain four maximums. Then as a final step the maximum of these four is determined. With the value we store the index of the first of the four elements loaded into the register. The offset from that base is its location in the XMM register.

Once again we have to address the part of the input array that isn't a multiple of the size of the XMM register. In this case we deal with these elements first. This does mean that we will be reading a couple elements twice. But since the maximum operation is idempotent, everything is okay.

This example shows a couple things. The first is how difficult it is to do horizontal operations. In the above example the horizontal operation is finding the largest value in a single register. To perform this we end up comparing the register against itself shifted right two elements, then that result is compared against itself shifted right one element. The bookkeeping is performed so that the index of the maximum value follows with the value.

The calculation of the actual maximum value is commented out in the above code since it is not actually returned from the function.

This example code can trivially be adjusted for other comparison operators on 32bit types including _mm_cmplt_epi32 to find a minimum int, or _mm_cmpgt_ps to find the maximum float, or _mm_cmplt_ps to find the minimum float.

The SSE2 version of the code takes a little under half the time as the Ansi version on longer arrays, and is about the same speed even for small arrays.

MicroPerf

Sunday, December 11, 2016

Finding the index of the maximum value in an array

No comments:

Post a Comment