MicroPerf: November 2016

Thursday, November 17, 2016

SSE in a loop - finding the first instance of an integer in an unsorted list.

Eventually we will need to control program flow from SSE2.

Generally getting data into or out of the XMM registers is a little difficult. Controlling program flow will generally require data in one of the general purpose registers.

_mm_cvtsi128_si32 will happily copy a single 32bit number out of a register.

More useful from a performance perspective are the operations that extract data from a larger portion of the XMM register.

_mm_movemask_epi8 and _mm_movemask_ps collect the most significant bits of the contained numbers. The most significant bit of a float or double is the sign bit. The XMM comparison routines will set numbers to all zeros or all ones, so the most significant bit will be the simple comparison result.

Some problems come up over and over again. These are worth looking at further.

Finding a particular number in an unsorted list happens a lot...

const int * FindInt_ANSI(const int iToSearchFor, const int * const plistOrig, const int cListOrig)
{
    int cListRemaining = cListOrig;
    const int *pi = plistOrig;
    while (cListRemaining > 0)
    {
        if (iToSearchFor == *pi)
        {
            return pi;
        }
        pi++;
        cListRemaining--;
    }
    return nullptr;
}
This is just a simple linear search. We can make a better simple linear search.

__forceinline __m128i m128i_from_ints(const int *pInts)
{
    return _mm_loadu_si128(reinterpret_cast<const __m128i *>(pInts));
}
__forceinline int _mm_movemask_epi32(__m128i a)
{
    return _mm_movemask_ps(_mm_castsi128_ps(a));
}

const int * __cdecl FindInt_SSE2(const int iToSearchFor, const int * const plistOrig, const int cListOrig)
{
#if DBG
    const int *dbgAnsiResult = FindInt_ANSI(iToSearchFor, plistOrig, cListOrig);
#endif // DBG
    const int * pi = plistOrig;
    int cListRemaining = cListOrig;
    if (cListRemaining >= 4)
    {
        __m128i miToSearchFor = _mm_set1_epi32(iToSearchFor);
        do
        {
            __m128i mFourInts = m128i_from_ints(pi);
            int iMoveMaskRet = _mm_movemask_epi32(_mm_cmpeq_epi32(mFourInts, miToSearchFor));
            if (iMoveMaskRet)
            {
                ULONG ulIndexBit;
                _BitScanForward(&ulIndexBit, iMoveMaskRet);
                const int * piRet = pi + ulIndexBit;
                ASSERT(dbgAnsiResult == piRet);
                return piRet;
            }
            pi += 4;
            cListRemaining -= 4;
        } while (cListRemaining >= 4);
    }
    const int * piRet = FindInt_ANSI(iToSearchFor, pi, cListRemaining);
    ASSERT(dbgAnsiResult == piRet);
    return piRet;
}

It's worth mentioning a couple things about this. The SSE2 instruction set does not have every possible variant of the instructions. We often need to be clever.

Often this means a lot of casting. It turns out that a float is the same size as an int is the same size as a DWORD. Small __forceinline helper functions that perform small functions like these casting will become important.

Also it's worth noting that we can't do everything with SSE2. For the edge cases we will end up falling back on the ANSI version of the code.

In this case the bulk of the work is done in SSE2. In particular the 4*floor(n/4) elements will be handled. The n%4 elements at the end will be handled by simple ANSI code.

The use of _BitScanForward is fun. This finds the index of the lowest 1 bit in the input number. Most professional software developers that I have worked with are surprised to find that this is a single instruction on the intel family of processors.

This code is better than the ANSI version, but is not yet complete. All of the memory loads are performed unaligned with _mm_loadu_si128. These memory calls are fastest when they are aligned to 16 byte boundaries. If we are willing to guarantee 16 byte boundaries we could use a more efficient call, _mm_load_si128. This instruction will crash if called with a memory location not divisible by 16.

A memory aligned version of this same function will follow in another post.

Sunday, November 13, 2016

Let's do something interesting with a RECT

While comparing two rectangles for equality is a nice place to start, it doesn't really provide much benefit over a standard memcmp.

But let's look at a problem that can benefit from the parallel nature of SSE2.

IsRectEmpty is a simple but useful function.

In standard Ansi code it looks something like:

BOOL IsRectEmpty_Ansi(const RECT *pThis)
{
return (pThis->top >= pThis->bottom) || (pThis->left >= pThis->right);
}

But we can do a little better with SSE2:

__forceinline __m128i m128i_from_RECT_SSE2(__in_ecount(1) const RECT *prc)
{
    return _mm_loadu_si128((__m128i*)prc);
}

BOOL IsRectEmpty_SSE2(const RECT *pThis)
{
__m128i rthis = m128i_from_RECT_SSE2(pThis);
// rthis=[pThis->left,pThis->top,pThis->right,pThis->bottom]
__m128i mTltl = _mm_shuffle_epi32(rthis, _MM_SHUFFLE(1,0,1,0));
// mTltl =[pThis->left,pThis->top,pThis->left,pThis->top]
__m128i mCompare = _mm_cmpgt_epi32(rthis, mTltl);
// mCompare=[0,0,(pThis->right>pThis->left),(pThis->bottom>pThis->top)]
int iCompare = _mm_movemask_epi32(mCompare);
// iCompare = 4*(pThis->right>pThis->left)+8*(pThis->bottom>pThis->top)
return 12 ^ iCompare;
}

This has a single large read from memory. It is branchless.

This same technique can be used for another useful rectangle function, PtInRect

__forceinline __m128i m128i_twice_from_point_SSE2(const POINT *ppt)
{
    return _mm_castpd_si128(_mm_load1_pd((double*)ppt));
    // return [ppt->x,ppt->y,ppt->x,ppt->y]
}
BOOL PtInRect_SSE2(const RECT *pThis, const POINT *ppt)
{
    __m128i rthis = m128i_from_RECT_SSE2(pThis);
    // rrc=[this->left,this->top,this->right,this->bottom]

    __m128i ptpt = m128i_twice_from_point_SSE2(ppt);
// ptpt=[ppt->x,ppt->y,ppt->x,ppt->y]

    __m128i allComparisons = _mm_cmplt_epi32(ptpt, rthis);
// looking for b0011
    bool bRet = (12 == _mm_movemask_ps(_mm_castsi128_ps(allComparisons)));
    // the _mm_movemask_ps takes the top bit (sign) of each of the four 32bit numbers
    // (doesn't matter that the intrinsic is designed for floats)
    // this returns a 4 bit number where the four bits will be:
    // !(ppt->x >= pThis->left)
    // !(ppt->y >= pThis->top)
    //   ppt->x < pThis->right
    //   ppt->y < pThis->bottom
    // So we want the bottom two bits to be zero and the top two bits to be one, meaning that the four bit number
    // should be b0011=12.
    return bRet;
}
viewing the generated assembly for this makes me happy:

movdqu      xmm0,xmmword ptr [rcx]
xor         eax,eax
movsd       xmm1,mmword ptr [rdx]
shufpd      xmm1,xmm1,0
pcmpgtd     xmm0,xmm1
movmskps    ecx,xmm0
cmp         ecx,0Ch
sete        al
ret

Friday, November 11, 2016

__m128i registers are exactly the same size as a RECT

SSE2 uses 128bit registers. These can often be seen as collections of 8bit, 16bit, 32bit, 64bit, or in a few cases as 128bit numbers. In the integer case this is done with the __m128i data type.

typedef union __declspec(intrin_type) __declspec(align(16)) __m128i {
    __int8              m128i_i8[16];
    __int16             m128i_i16[8];
    __int32             m128i_i32[4];
    __int64             m128i_i64[2];
    unsigned __int8     m128i_u8[16];
    unsigned __int16    m128i_u16[8];
    unsigned __int32    m128i_u32[4];
    unsigned __int64    m128i_u64[2];
} __m128i;

On Windows rectangles are a collection of 4 32bit numbers.

typedef struct tagRECT
{
    LONG    left;
    LONG    top;
    LONG    right;
    LONG    bottom;
} RECT, *PRECT, NEAR *NPRECT, FAR *LPRECT;

This happens to be exactly the size of a __m128i register on SSE2.

So let's test for equality.

SSE programming tends to be a process of building up small functions to perform small tasks. These are basically used as macros. To prevent function overhead from dominating these tiny functions, inlining is forced. While it has gone out of style to use the directive __forceinline, this is a case where it is absolutely essential. The optimizer have gotten good at exactly these types of coding patterns.

__forceinline __m128i m128i_from_RECT_SSE2(const RECT *prc)
{
    return _mm_loadu_si128((__m128i*)prc);
}

Testing for equality of two rects is a common task... In particular this is performed by many library functions including "Rect.Equality Operator"
There are comparison operators. And there's a special fun operator that takes an SSE register and return a normal register.

The special fun operator is _mm_movemask_ps and its cousins including _mm_movemask_epi8.

These take the high bit of each of the elements in the large register, and merge them together into a single number.

_mm_movemask_ps is intended for float elements, it is described as checking the sign bit. This happens to just be the high bit of the 32bit value. So it can be useful for checking the results of integer comparisons.
It's often convenient to name helper functions using the same pattern as the intrinsics, so we create the helper function

__forceinline int _mm_movemask_epi32(__m128i a)
{
return _mm_movemask_ps(_mm_castsi128_ps(a));
}
bool RectEquality_SSE2(const RECT *prcA, const RECT *prcB)
{
__m128i mRectA = m128i_from_RECT_SSE2(prcA);
__m128i mRectB = m128i_from_RECT_SSE2(prcB);
__m128i mRectsEqualAB = _mm_cmpeq_epi32(mRectA, mRectB);
int iWhichValuesEqual = _mm_movemask_epi32(mRectsEqualAB);
bool bRet = (15 == iWhichValuesEqual);
return bRet;
}

And this seems useful.

Let's look at the generated assembly language for this function (including its inlined helpers):

movdqu      xmm1,xmmword ptr [rcx]
movdqu      xmm0,xmmword ptr [rdx]
pcmpeqd     xmm1,xmm0
movmskps    eax,xmm1
cmp         eax,0Fh
sete        al
ret

This is pretty tight.

Micro perf

I used to do performance work professionally. Some of it was really satisfying.
The interconnecting pieces of a well optimized piece of code fit together like a sliding block puzzle.

Chips got faster and lots of the old styles of optimization became less important.
But then the intel SSE instruction set and its follow-ups were introduced. The instruction set has a limited set of instructions that operate on very large (128bit) registers, fairly quickly.

But everything old is new again. And a lot of the stuff that was on the bleeding performance edge two decades ago is now relevant and exciting again.

In particular I'm a big fan of the book "Hacker's Delight". This is one of the only books mentioned in the text of Knuth volume 4.

But this stuff is only really important for small scale perf. This is not algorithmic optimization. Sometimes this is called pinhole optimization. I call this micro perf.

I will be referencing a lot of instructions with deeper explanations in https://software.intel.com/sites/landingpage/IntrinsicsGuide/