Saturday, December 17, 2016

Still more dot product goodness - unsigned short

Unfortunately there isn't an equivalent unsigned instruction to the _mm_madd_epi16 instruction in unsigned.  The unsigned multiply has to be constructed out of two multiplies, one that produces as result the low 16 bits, the other that produces the high 16 bits.  The instruction for the low 16bits is in fact shared with the signed version since the signed low 16 bit result is the same for signed and unsigned 16bit*16bit=32bit multiplies.

__forceinline unsigned int dotProduct_Ansi(const unsigned short *pA, const unsigned short *pB, UINT cElements)
{
    unsigned int iRet = 0;
    UINT cElementsRemaining = cElements;
    while (cElementsRemaining > 0)
    {
        cElementsRemaining--;
        iRet += ((unsigned int)pA[cElementsRemaining]) * ((unsigned int)pB[cElementsRemaining]);
    }
    return iRet;
}


unsigned int dotProduct_SSE2(const unsigned short *pA, const unsigned short *pB, UINT cElements)
{
    UINT cElements_endOfFour = cElements&~3;
    unsigned int iRet = dotProduct_Ansi(pA + cElements_endOfFour, pB + cElements_endOfFour, cElements & 3);
    UINT cElementsRemaining = cElements_endOfFour;
    if (cElementsRemaining > 0)
    {
        __m128i mSummedRight = _mm_setzero_si128();
        __m128i mSummedLeft = _mm_setzero_si128();
        do
        {
            cElementsRemaining -= 4;
            __m128i mA = _mm_loadu_si128((__m128i const*)&pA[cElementsRemaining]);
            __m128i mB = _mm_loadu_si128((__m128i const*)&pB[cElementsRemaining]);

            __m128i mulHi = _mm_mulhi_epu16(mA, mB);
            __m128i mulLo = _mm_mullo_epi16(mA, mB);

            __m128i mulLeft = _mm_unpacklo_epi16(mulLo, mulHi);
            __m128i mulRight = _mm_unpackhi_epi16(mulLo, mulHi);

            mSummedLeft = _mm_add_epi32(mSummedLeft, mulLeft);
            mSummedRight = _mm_add_epi32(mSummedRight, mulRight);
        } while (cElementsRemaining > 0);
        iRet += horizontalSum_SSE2(_mm_add_epi32(mSummedLeft, mSummedRight));
    }
    return iRet;
}

Keeping separate sums for left and right is to reduce register pressure to allow increased instruction pipelining. 

No comments:

Post a Comment