Ordinarily it is very very bad to read before or after a buffer. This can cause crashes. There is a very limited case where we can be guaranteed that a buffer overrun or buffer underrun will not result in a crash : when the buffer overrun or buffer underrun is within the same cache line.
The acquiring the cache line size is not portable between AMD and Intel. The cache line size has generally been 64bytes in recent years. But in every CPU that supports SSE2, the cache line size is at least 16 bytes. This means that if we are able to access a memory location p, we are able to access the entire 16 byte block starting at (p & ~0xf).
Loading a single byte of memory from a cache line takes the same amount of time as loading all of the bytes from a cache line.
size_t __cdecl strlen_SSE2(const char *szOrig)
{
const char * sz = szOrig;
__m128i mmZero = _mm_setzero_si128();
ULONG_PTR byteOffsetFromAligned = (15 & (ULONG_PTR)sz);
if (byteOffsetFromAligned)
{
ULONG_PTR pAlignedsz = ((ULONG_PTR)sz) - byteOffsetFromAligned;
ULONG iMoveMaskRet;
iMoveMaskRet = _mm_movemask_epi8(_mm_cmpeq_epi8(_mm_load_si128((__m128i *)pAlignedsz), mmZero)) & ((1 << 16) - (1 << byteOffsetFromAligned));
if (iMoveMaskRet)
{
ULONG ulIndexBit;
_BitScanForward(&ulIndexBit, iMoveMaskRet);
const size_t retLength = (ulIndexBit - byteOffsetFromAligned);
return retLength;
}
sz = 16 + (const char *)pAlignedsz;
}
ULONG iMoveMaskRet;
while (!(iMoveMaskRet = _mm_movemask_epi8(_mm_cmpeq_epi8(_mm_load_si128((__m128i *)sz), mmZero))))
{
sz += 16;
}
ULONG ulIndexBit;
_BitScanForward(&ulIndexBit, iMoveMaskRet);
const size_t retLength = (sz - szOrig) + ulIndexBit;
return retLength;
}
If the function is called on a memory location that is not aligned the function backs up to the start of a 16 byte block. In this case we make sure to ignore null characters that happen before the memory location in the calling parameters. This is done by masking, in particular
& ((1 << 16) - (1 << byteOffsetFromAligned))
We are looking for the first null character. The core of this function is the loop
while (!(iMoveMaskRet = _mm_movemask_epi8(_mm_cmpeq_epi8(_mm_load_si128((__m128i *)sz), mmZero))))This loads a 16 byte block from aligned memory. It compares each of the bytes to null. It exits the loop if any of the bytes is null. If none of the 16 bytes are null, it advances to the next aligned 16 byte block.
{
sz += 16;
}
To find the first byte we are looking for the index of the lowest 1 bit in the result of the _mm_movemask_epi8. This bit manipulation functionality is exposed by the intrinsic _BitScanForward.
The compiled assembly is a very tight central loop:
01001060 movaps xmm0,xmmword ptr [edx]
01001063 pcmpeqb xmm0,xmm1
01001067 pmovmskb eax,xmm0
0100106B test eax,eax
0100106D jne strlen_SSE2+83h (01001083h)
0100106F nop
01001070 movaps xmm0,xmmword ptr [edx+10h]
01001074 add edx,10h
01001077 pcmpeqb xmm0,xmm1
0100107B pmovmskb eax,xmm0
0100107F test eax,eax
01001081 je strlen_SSE2+70h (01001070h)
Overall this routine takes about half the time of the standard strlen. In x64 code it ends up being faster for all but empty strings. For x86 code this routine ends up being the same factor faster for long strings, but is in fact a little slower for short strings. On x86, the break-even point seems to be around string 15 characters long.
This is easily adjusted for use with wide character (16bit) strings as a replacement for wcslen.
No comments:
Post a Comment