Monday, September 25, 2017

Reversing the 16bit numbers in an XMM register using SSE2 and SSSE3

A bunch of algorithms end up requiring reversing the order of a bunch of bytes.  In particular I've seen this come up a lot in code to decompress GIF images. A diagonal string distance algorithm on standard char strings would require reversing a string of bytes.

But my world doesn't involve many single width strings.  My world tends to have 16bit Unicode strings.  So reversing a string of 16bit characters will instead come up.  16bit numbers are treated the same as 16bit Unicode characters.

This can be done just with SSE2 instructions.

__m128i _mm_reverse_epi16_SSE2(const __m128i &mToReverse)
{
    __m128i mReversed_epi32 = _mm_shuffle_epi32(mToReverse, _MM_SHUFFLE(0, 1, 2, 3));
    __m128i mLowDone = _mm_shufflelo_epi16(mReversed_epi32, _MM_SHUFFLE(2, 3, 0, 1));
    __m128i mResult = _mm_shufflehi_epi16(mLowDone, _MM_SHUFFLE(2, 3, 0, 1));
    return mResult;
}


But in fact can be done in a single instruction in SSSE3 (since the pshufb instruction supports using a  memory address for the second parameter).

__m128i _mm_reverse_epi16_SSSE3(const __m128i &mToReverse)
{
    static const __m128i reverseKeys = { 14,15,12,13,10,11,8,9,6,7,4,5,2,3,0,1 };
    __m128i mResult = _mm_shuffle_epi8(mToReverse, reverseKeys);
    return mResult;
}


This is satisfying.

Sunday, September 24, 2017

Levenshtein distance calculated on the diagonal (but only in ANSI for now)

A long time ago I promised to talk about a better way of calculating Levenshtein distance (edit distance).  A way that is more conducive to vectorization.
The trick is to calculate the same edit distance matrix as the standard two row method, but instead fill in the values in a diagonal manner.
This ends up having each diagonal depending on the previous two diagonals.  But importantly each diagonal does not depend on any other element of itself.

size_t Levenshtein_distance_diagonal_ANSI(const wchar_t* const s, const wchar_t* const t)
{
    const unsigned int lenS = wcslen(s);
    const unsigned int lenT = wcslen(t);
    if (0 == lenS)
    {
        return lenT;
    }
    if (0 == lenT)
    {
        return lenS;
    }
    if (0 == wcscmp(s, t))
    {
        return 0;
    }

    // create work vectors of integer distances
    unsigned int *v0 = new unsigned int[lenT + 1];
    unsigned int *v1 = new unsigned int[lenT + 1];
    unsigned int *v2 = new unsigned int[lenT + 1];

    v1[0] = 0;
    for (unsigned int j =1; j <= (lenS+lenT); j++)
    {
        unsigned int iMin;
        unsigned int iMax;
        v2[0] = j * 1;
        iMin = (unsigned int)max(1, (int)(j - lenS));
        if (j <= lenT)
        {
            v2[j] = j * 1;
            iMax = j-1;
        }
        else
        {
            iMax = lenT;
        }

        for (unsigned int i = iMin; i <= iMax; i++)
        {
            unsigned int substituteCost = v0[i-1]+(s[j-i-1] != t[i-1]);
            unsigned int insertionCost = v1[i-1]+1;
            unsigned int deletionCost = v1[i]+1;
            v2[i] = min(min(substituteCost, insertionCost), deletionCost);
        }
        unsigned int * const vTemp = v0;
        v0 = v1;
        v1 = v2;
        v2 = vTemp;
    }

    unsigned int retVal = v1[lenT];
    delete[] v0;
    delete[] v1;
    delete[] v2;
    return retVal;
}

While this implementation is not any better than the two row implementation, we can see where this is going.
This also can work with storage only for two diagonals, overwriting a diagonal just after its value is read.