MicroPerf: September 2017

Monday, September 25, 2017

Reversing the 16bit numbers in an XMM register using SSE2 and SSSE3

A bunch of algorithms end up requiring reversing the order of a bunch of bytes. In particular I've seen this come up a lot in code to decompress GIF images. A diagonal string distance algorithm on standard char strings would require reversing a string of bytes.

But my world doesn't involve many single width strings. My world tends to have 16bit Unicode strings. So reversing a string of 16bit characters will instead come up. 16bit numbers are treated the same as 16bit Unicode characters.

This can be done just with SSE2 instructions.

__m128i _mm_reverse_epi16_SSE2(const __m128i &mToReverse)
{
    __m128i mReversed_epi32 = _mm_shuffle_epi32(mToReverse, _MM_SHUFFLE(0, 1, 2, 3));
    __m128i mLowDone = _mm_shufflelo_epi16(mReversed_epi32, _MM_SHUFFLE(2, 3, 0, 1));
    __m128i mResult = _mm_shufflehi_epi16(mLowDone, _MM_SHUFFLE(2, 3, 0, 1));
    return mResult;
}

But in fact can be done in a single instruction in SSSE3 (since the pshufb instruction supports using a memory address for the second parameter).

__m128i _mm_reverse_epi16_SSSE3(const __m128i &mToReverse)
{
    static const __m128i reverseKeys = { 14,15,12,13,10,11,8,9,6,7,4,5,2,3,0,1 };
    __m128i mResult = _mm_shuffle_epi8(mToReverse, reverseKeys);
    return mResult;
}

This is satisfying.

Sunday, September 24, 2017

Levenshtein distance calculated on the diagonal (but only in ANSI for now)

A long time ago I promised to talk about a better way of calculating Levenshtein distance (edit distance). A way that is more conducive to vectorization.
The trick is to calculate the same edit distance matrix as the standard two row method, but instead fill in the values in a diagonal manner.
This ends up having each diagonal depending on the previous two diagonals. But importantly each diagonal does not depend on any other element of itself.

size_t Levenshtein_distance_diagonal_ANSI(const wchar_t* const s, const wchar_t* const t)
{
const unsigned int lenS = wcslen(s);
const unsigned int lenT = wcslen(t);
if (0 == lenS)
{
return lenT;
}
if (0 == lenT)
{
return lenS;
}
if (0 == wcscmp(s, t))
{
return 0;
}

// create work vectors of integer distances
unsigned int *v0 = new unsigned int[lenT + 1];
unsigned int *v1 = new unsigned int[lenT + 1];
unsigned int *v2 = new unsigned int[lenT + 1];

v1[0] = 0;
for (unsigned int j =1; j <= (lenS+lenT); j++)
{
unsigned int iMin;
unsigned int iMax;
v2[0] = j * 1;
iMin = (unsigned int)max(1, (int)(j - lenS));
if (j <= lenT)
{
v2[j] = j * 1;
iMax = j-1;
}
else
{
iMax = lenT;
}

for (unsigned int i = iMin; i <= iMax; i++)
{
unsigned int substituteCost = v0[i-1]+(s[j-i-1] != t[i-1]);
unsigned int insertionCost = v1[i-1]+1;
unsigned int deletionCost = v1[i]+1;
v2[i] = min(min(substituteCost, insertionCost), deletionCost);
}
unsigned int * const vTemp = v0;
v0 = v1;
v1 = v2;
v2 = vTemp;
}

unsigned int retVal = v1[lenT];
delete[] v0;
delete[] v1;
delete[] v2;
return retVal;
}

While this implementation is not any better than the two row implementation, we can see where this is going.
This also can work with storage only for two diagonals, overwriting a diagonal just after its value is read.