Friday, November 11, 2016

__m128i registers are exactly the same size as a RECT

SSE2 uses 128bit registers.  These can often be seen as collections of 8bit, 16bit, 32bit, 64bit, or in a few cases as 128bit numbers.  In the integer case this is done with the __m128i data type.

typedef union __declspec(intrin_type) __declspec(align(16)) __m128i {
    __int8              m128i_i8[16];
    __int16             m128i_i16[8];
    __int32             m128i_i32[4];
    __int64             m128i_i64[2];
    unsigned __int8     m128i_u8[16];
    unsigned __int16    m128i_u16[8];
    unsigned __int32    m128i_u32[4];
    unsigned __int64    m128i_u64[2];
} __m128i;

On Windows rectangles are a collection of 4 32bit numbers.

typedef struct tagRECT
{
    LONG    left;
    LONG    top;
    LONG    right;
    LONG    bottom;
} RECT, *PRECT, NEAR *NPRECT, FAR *LPRECT;


This happens to be exactly the size of a __m128i register on SSE2.

So let's test for equality.

SSE programming tends to be a process of building up small functions to perform small tasks.  These are basically used as macros.  To prevent function overhead from dominating these tiny functions, inlining is forced.  While it has gone out of style to use the directive __forceinline, this is a case where it is absolutely essential.  The optimizer have gotten good at exactly these types of coding patterns.

__forceinline __m128i m128i_from_RECT_SSE2(const RECT *prc)
{
    return _mm_loadu_si128((__m128i*)prc);
}


Testing for equality of two rects is a common task... In particular this is performed by many library functions including "Rect.Equality Operator"
There are comparison operators.  And there's a special fun operator that takes an SSE register and return a normal register.

The special fun operator is _mm_movemask_ps and its cousins including _mm_movemask_epi8.

These take the high bit of each of the elements in the large register, and merge them together into a single number. 

_mm_movemask_ps is intended for float elements, it is described as checking the sign bit.  This happens to just be the high bit of the 32bit value.  So it can be useful for checking the results of integer comparisons.
It's often convenient to name helper functions using the same pattern as the intrinsics, so we create the helper function

__forceinline int _mm_movemask_epi32(__m128i a)
{
 return _mm_movemask_ps(_mm_castsi128_ps(a));
}
bool RectEquality_SSE2(const RECT *prcA, const RECT *prcB)
{
 __m128i mRectA = m128i_from_RECT_SSE2(prcA);
 __m128i mRectB = m128i_from_RECT_SSE2(prcB);
 __m128i mRectsEqualAB = _mm_cmpeq_epi32(mRectA, mRectB);
 int iWhichValuesEqual = _mm_movemask_epi32(mRectsEqualAB);
 bool bRet = (15 == iWhichValuesEqual);
 return bRet;
}

And this seems useful.

Let's look at the generated assembly language for this function (including its inlined helpers):


movdqu      xmm1,xmmword ptr [rcx] 
movdqu      xmm0,xmmword ptr [rdx] 
pcmpeqd     xmm1,xmm0 
movmskps    eax,xmm1 
cmp         eax,0Fh 
sete        al 
ret
 

This is pretty tight.

No comments:

Post a Comment