The dot product is one of the fundamental algorithms of numerical calculation. In particular matrix multiplication can be phrased as a series of dot products.
The feedForward stage of neural networks is a series of matrix multiplications followed by a non-linear transformation of the results.
float dotProduct_Ansi(const float *pA, const float *pB, UINT uiABOrig)
{
float fRet = 0;
UINT uiAB = uiABOrig;
while (uiAB > 0)
{
uiAB--;
fRet += pA[uiAB] * pB[uiAB];
}
return fRet;
}
float dotProduct_SSE2(const float *pA, const float *pB, UINT uiABOrig)
{
UINT uiAB_endOfFour = uiABOrig&~3;
float fRet = dotProduct_Ansi(pA + uiAB_endOfFour, pB + uiAB_endOfFour, uiABOrig & 3);
UINT uiAB = uiAB_endOfFour;
if (uiAB > 0)
{
__m128 mSummed = _mm_setzero_ps();
do
{
uiAB-=4;
__m128 mA = _mm_loadu_ps((const float *)&pA[uiAB]);
__m128 mB = _mm_loadu_ps((const float *)&pB[uiAB]);
__m128 mMulAB = _mm_mul_ps(mA, mB);
mSummed = _mm_add_ps(mSummed, mMulAB);
} while (uiAB > 0);
fRet += horizontalSum_SSE2(mSummed);
}
return fRet;
}
This follows a bunch of patterns that come up in SSE2. We handle the leftover bit that doesn't fit conveniently in an __m128 register in the Ansi code. The Ansi code that conveniently be used to verify the SSE2 code. I'm fond of loops with a control variable decreasing to zero. In this case the loop controlling variable is the number of elements remaining to be added. It has been constructed so that it is always a multiple of four - the four elements that fit into an __m128. At then end we use the horizontalSum from a previous post.
No comments:
Post a Comment