MicroPerf: the dot product

The dot product is one of the fundamental algorithms of numerical calculation. In particular matrix multiplication can be phrased as a series of dot products.

The feedForward stage of neural networks is a series of matrix multiplications followed by a non-linear transformation of the results.

float dotProduct_Ansi(const float *pA, const float *pB, UINT uiABOrig)
{
    float fRet = 0;
    UINT uiAB = uiABOrig;
    while (uiAB > 0)
    {
        uiAB--;
        fRet += pA[uiAB] * pB[uiAB];
    }
    return fRet;
}

float dotProduct_SSE2(const float *pA, const float *pB, UINT uiABOrig)
{
    UINT uiAB_endOfFour = uiABOrig&~3;
    float fRet = dotProduct_Ansi(pA + uiAB_endOfFour, pB + uiAB_endOfFour, uiABOrig & 3);
    UINT uiAB = uiAB_endOfFour;
    if (uiAB > 0)
    {
        __m128 mSummed = _mm_setzero_ps();
        do
        {
            uiAB-=4;
            __m128 mA = _mm_loadu_ps((const float *)&pA[uiAB]);
            __m128 mB = _mm_loadu_ps((const float *)&pB[uiAB]);
            __m128 mMulAB = _mm_mul_ps(mA, mB);
            mSummed = _mm_add_ps(mSummed, mMulAB);
        } while (uiAB > 0);
        fRet += horizontalSum_SSE2(mSummed);
    }
    return fRet;
}

This follows a bunch of patterns that come up in SSE2.   We handle the leftover bit that doesn't fit conveniently in an __m128 register in the Ansi code. The Ansi code that conveniently be used to verify the SSE2 code. I'm fond of loops with a control variable decreasing to zero. In this case the loop controlling variable is the number of elements remaining to be added. It has been constructed so that it is always a multiple of four - the four elements that fit into an __m128. At then end we use the horizontalSum from a previous post.

MicroPerf

Friday, December 9, 2016

the dot product - float

No comments:

Post a Comment