typedef union __declspec(intrin_type) __declspec(align(16)) __m128i {
__int8 m128i_i8[16];
__int16 m128i_i16[8];
__int32 m128i_i32[4];
__int64 m128i_i64[2];
unsigned __int8 m128i_u8[16];
unsigned __int16 m128i_u16[8];
unsigned __int32 m128i_u32[4];
unsigned __int64 m128i_u64[2];
} __m128i;
__int8 m128i_i8[16];
__int16 m128i_i16[8];
__int32 m128i_i32[4];
__int64 m128i_i64[2];
unsigned __int8 m128i_u8[16];
unsigned __int16 m128i_u16[8];
unsigned __int32 m128i_u32[4];
unsigned __int64 m128i_u64[2];
} __m128i;
On Windows rectangles are a collection of 4 32bit numbers.
typedef struct tagRECT
{
LONG left;
LONG top;
LONG right;
LONG bottom;
} RECT, *PRECT, NEAR *NPRECT, FAR *LPRECT;
This happens to be exactly the size of a __m128i register on SSE2.
So let's test for equality.
SSE programming tends to be a process of building up small functions to perform small tasks. These are basically used as macros. To prevent function overhead from dominating these tiny functions, inlining is forced. While it has gone out of style to use the directive __forceinline, this is a case where it is absolutely essential. The optimizer have gotten good at exactly these types of coding patterns.
__forceinline __m128i m128i_from_RECT_SSE2(const RECT *prc)
{
return _mm_loadu_si128((__m128i*)prc);
}
Testing for equality of two rects is a common task... In particular this is performed by many library functions including "Rect.Equality Operator"
There are comparison operators. And there's a special fun operator that takes an SSE register and return a normal register.
The special fun operator is _mm_movemask_ps and its cousins including _mm_movemask_epi8.
These take the high bit of each of the elements in the large register, and merge them together into a single number.
_mm_movemask_ps is intended for float elements, it is described as checking the sign bit. This happens to just be the high bit of the 32bit value. So it can be useful for checking the results of integer comparisons.
It's often convenient to name helper functions using the same pattern as the intrinsics, so we create the helper function
__forceinline int _mm_movemask_epi32(__m128i a)
{
return _mm_movemask_ps(_mm_castsi128_ps(a));
}
bool RectEquality_SSE2(const RECT *prcA, const RECT *prcB)
{
__m128i mRectA = m128i_from_RECT_SSE2(prcA);
__m128i mRectB = m128i_from_RECT_SSE2(prcB);
__m128i mRectsEqualAB = _mm_cmpeq_epi32(mRectA, mRectB);
int iWhichValuesEqual = _mm_movemask_epi32(mRectsEqualAB);
bool bRet = (15 == iWhichValuesEqual);
return bRet;
}
And this seems useful.
Let's look at the generated assembly language for this function (including its inlined helpers):
movdqu xmm1,xmmword ptr [rcx]
movdqu xmm0,xmmword ptr [rdx]
pcmpeqd xmm1,xmm0
movmskps eax,xmm1
cmp eax,0Fh
sete al
ret
This is pretty tight.
No comments:
Post a Comment