Monday, December 12, 2016

string processing using SSE2 - making sure we don't crash

Much of the time we know the size of the array to be examined.  But there is a common case when we don't know the size ahead of time, and that case is strings.  String length in C++ is determined at runtime by the location of a terminating null character.

Ordinarily it is very very bad to read before or after a buffer. This can cause crashes.  There is a very limited case where we can be guaranteed that a buffer overrun or buffer underrun will not result in a crash : when the buffer overrun or buffer underrun is within the same cache line.

The acquiring the cache line size is not portable between AMD and Intel.  The cache line size has generally been 64bytes in recent years.  But in every CPU that supports SSE2, the cache line size is at least 16 bytes.  This means that if we are able to access a memory location p, we are able to access the entire 16 byte block starting at (p & ~0xf). 

Loading a single byte of memory from a cache line takes the same amount of time as loading all of the bytes from a cache line. 

size_t __cdecl strlen_SSE2(const char *szOrig)
{
    const char * sz = szOrig;
    __m128i mmZero = _mm_setzero_si128();
    ULONG_PTR byteOffsetFromAligned = (15 & (ULONG_PTR)sz);
    if (byteOffsetFromAligned)
    {
        ULONG_PTR pAlignedsz = ((ULONG_PTR)sz) - byteOffsetFromAligned;
        ULONG iMoveMaskRet;
        iMoveMaskRet = _mm_movemask_epi8(_mm_cmpeq_epi8(_mm_load_si128((__m128i *)pAlignedsz), mmZero)) & ((1 << 16) - (1 << byteOffsetFromAligned));
        if (iMoveMaskRet)
        {
            ULONG ulIndexBit;
            _BitScanForward(&ulIndexBit, iMoveMaskRet);
            const size_t retLength = (ulIndexBit - byteOffsetFromAligned);
            return retLength;
        }
        sz = 16 + (const char *)pAlignedsz;
    }
    ULONG iMoveMaskRet;
    while (!(iMoveMaskRet = _mm_movemask_epi8(_mm_cmpeq_epi8(_mm_load_si128((__m128i *)sz), mmZero))))
    {
        sz += 16;
    }

    ULONG ulIndexBit;
    _BitScanForward(&ulIndexBit, iMoveMaskRet);
    const size_t retLength = (sz - szOrig) + ulIndexBit;
    return retLength;
}

If the function is called on a memory location that is not aligned the function backs up to the start of a 16 byte block.  In this case we make sure to ignore null characters that happen before the memory location in the calling parameters.  This is done by masking, in particular
& ((1 << 16) - (1 << byteOffsetFromAligned))

We are looking for the first null character.  The core of this function is the loop
while (!(iMoveMaskRet = _mm_movemask_epi8(_mm_cmpeq_epi8(_mm_load_si128((__m128i *)sz), mmZero))))
{
    sz += 16;
}
This loads a 16 byte block from aligned memory.  It compares each of the bytes to null.  It exits the loop if any of the bytes is null.  If none of the 16 bytes are null, it advances to the next aligned 16 byte block.

To find the first byte we are looking for the index of the lowest 1 bit in the result of the _mm_movemask_epi8.  This bit manipulation functionality is exposed by the intrinsic _BitScanForward.

The compiled assembly is a very tight central loop:
01001060  movaps      xmm0,xmmword ptr [edx] 
01001063  pcmpeqb     xmm0,xmm1 
01001067  pmovmskb    eax,xmm0 
0100106B  test        eax,eax 
0100106D  jne         strlen_SSE2+83h (01001083h) 
0100106F  nop 
01001070  movaps      xmm0,xmmword ptr [edx+10h] 
01001074  add         edx,10h 
01001077  pcmpeqb     xmm0,xmm1 
0100107B  pmovmskb    eax,xmm0 
0100107F  test        eax,eax 
01001081  je          strlen_SSE2+70h (01001070h) 


Overall this routine takes about half the time of the standard strlen.  In x64 code it ends up being faster for all but empty strings.  For x86 code this routine ends up being the same factor faster for long strings, but is in fact a little slower for short strings.  On x86, the break-even point seems to be around string 15 characters long.

This is easily adjusted for use with wide character (16bit) strings as a replacement for wcslen.

No comments:

Post a Comment