Thursday, November 17, 2016

SSE in a loop - finding the first instance of an integer in an unsorted list.

Eventually we will need to control program flow from SSE2. 

Generally getting data into or out of the XMM registers is a little difficult.  Controlling program flow will generally require data in one of the general purpose registers.

_mm_cvtsi128_si32 will happily copy a single 32bit number out of a register.

More useful from a performance perspective are the operations that extract data from a larger portion of the XMM register.

_mm_movemask_epi8 and _mm_movemask_ps collect the most significant bits of the contained numbers.  The most significant bit of a float or double is the sign bit.  The XMM comparison routines will set numbers to all zeros or all ones, so the most significant bit will be the simple comparison result.

Some problems come up over and over again.  These are worth looking at further.

Finding a particular number in an unsorted list happens a lot...

const int * FindInt_ANSI(const int iToSearchFor, const int * const plistOrig, const int cListOrig)
{
    int cListRemaining = cListOrig;
    const int *pi = plistOrig;
    while (cListRemaining > 0)
    {
        if (iToSearchFor == *pi)
        {
            return pi;
        }
        pi++;
        cListRemaining--;
    }
    return nullptr;
}

This is just a simple linear search.  We can make a better simple linear search.

__forceinline __m128i m128i_from_ints(const int *pInts)
{
    return _mm_loadu_si128(reinterpret_cast<const __m128i *>(pInts));
}
__forceinline int _mm_movemask_epi32(__m128i a)
{
    return _mm_movemask_ps(_mm_castsi128_ps(a));
}

const int * __cdecl FindInt_SSE2(const int iToSearchFor, const int * const plistOrig, const int cListOrig)
{
#if DBG
    const int *dbgAnsiResult = FindInt_ANSI(iToSearchFor, plistOrig, cListOrig);
#endif // DBG
    const int * pi = plistOrig;
    int cListRemaining = cListOrig;
    if (cListRemaining >= 4)
    {
        __m128i miToSearchFor = _mm_set1_epi32(iToSearchFor);
        do
        {
            __m128i mFourInts = m128i_from_ints(pi);
            int iMoveMaskRet = _mm_movemask_epi32(_mm_cmpeq_epi32(mFourInts, miToSearchFor));
            if (iMoveMaskRet)
            {
                ULONG ulIndexBit;
                _BitScanForward(&ulIndexBit, iMoveMaskRet);
                const int * piRet = pi + ulIndexBit;
                ASSERT(dbgAnsiResult == piRet);
                return piRet;
            }
            pi += 4;
            cListRemaining -= 4;
        } while (cListRemaining >= 4);
    }
    const int * piRet = FindInt_ANSI(iToSearchFor, pi, cListRemaining);
    ASSERT(dbgAnsiResult == piRet);
    return piRet;
}

It's worth mentioning a couple things about this.  The SSE2 instruction set does not have every possible variant of the instructions.  We often need to be clever. 

Often this means a lot of casting.  It turns out that a float is the same size as an int is the same size as a DWORD.  Small __forceinline helper functions that perform small functions like these casting will become important.

Also it's worth noting that we can't do everything with SSE2.  For the edge cases we will end up falling back on the ANSI version of the code.

In this case the bulk of the work is done in SSE2.  In particular the 4*floor(n/4) elements will be handled.  The n%4 elements at the end will be handled by simple ANSI code.

The use of _BitScanForward is fun.  This finds the index of the lowest 1 bit in the input number.  Most professional software developers that I have worked with are surprised to find that this is a single instruction on the intel family of processors.

This code is better than the ANSI version, but is not yet complete.  All of the memory loads are performed unaligned with _mm_loadu_si128.  These memory calls are fastest when they are aligned to 16 byte boundaries.  If we are willing to guarantee 16 byte boundaries we could use a more efficient call, _mm_load_si128.  This instruction will crash if called with a memory location not divisible by 16. 

A memory aligned version of this same function will follow in another post.

No comments:

Post a Comment