Friday, December 9, 2016

the dot product - float

The dot product is one of the fundamental algorithms of numerical calculation.  In particular matrix multiplication can be phrased as a series of dot products. 

The feedForward stage of neural networks is a series of matrix multiplications followed by a non-linear transformation of the results. 

float dotProduct_Ansi(const float *pA, const float *pB, UINT uiABOrig)
    float fRet = 0;
    UINT uiAB = uiABOrig;
    while (uiAB > 0)
        fRet += pA[uiAB] * pB[uiAB];
    return fRet;

float dotProduct_SSE2(const float *pA, const float *pB, UINT uiABOrig)
    UINT uiAB_endOfFour = uiABOrig&~3;
    float fRet = dotProduct_Ansi(pA + uiAB_endOfFour, pB + uiAB_endOfFour, uiABOrig & 3);
    UINT uiAB = uiAB_endOfFour;
    if (uiAB > 0)
        __m128 mSummed = _mm_setzero_ps();
            __m128 mA = _mm_loadu_ps((const float *)&pA[uiAB]);
            __m128 mB = _mm_loadu_ps((const float *)&pB[uiAB]);
            __m128 mMulAB = _mm_mul_ps(mA, mB);
            mSummed = _mm_add_ps(mSummed, mMulAB);
        } while (uiAB > 0);
        fRet += horizontalSum_SSE2(mSummed);
    return fRet;

This follows  a bunch of patterns that come up in SSE2.   We handle the leftover bit that doesn't fit conveniently in an __m128 register in the Ansi code.  The Ansi code that conveniently be used to verify the SSE2 code.  I'm fond of loops with a control variable decreasing to zero.  In this case the loop controlling variable is the number of elements remaining to be added.  It has been constructed so that it is always a multiple of four - the four elements that fit into an __m128.  At then end we use the horizontalSum from a previous post.

