MicroPerf: January 2017

Monday, January 16, 2017

transposing a non-square float matrix with SSE

The transpose of a matrix can be useful. During a matrix multiply a transpose of the right hand matrix will allow the result to be expressed as a series of dot products, with the inputs of the dot product sequential in memory.

But often the matrix will be non-square, and so cannot be transposed in place. But let's look at transposing with a source and destination.

We will transpose in 4x4 blocks. Momentarily ignoring what is evenly divisible by 4:
    for (size_t y = nHeight&~3; y >0;)
    {
        y -= 4;
        for (size_t x = nWidth&~3; x>0;)
        {
            x -= 4;
            __m128 row0;
            __m128 row1;
            __m128 row2;
            __m128 row3;
            row0 = _mm_loadu_ps(&pMatrixIn[(y + 0)*nWidth + x]);
            row1 = _mm_loadu_ps(&pMatrixIn[(y + 1)*nWidth + x]);
            row2 = _mm_loadu_ps(&pMatrixIn[(y + 2)*nWidth + x]);
            row3 = _mm_loadu_ps(&pMatrixIn[(y + 3)*nWidth + x]);
            _MM_TRANSPOSE4_PS(row0, row1, row2, row3);
            _mm_storeu_ps(&pMatrixOut[(x + 0)*nHeight + y], row0);
            _mm_storeu_ps(&pMatrixOut[(x + 1)*nHeight + y], row1);
            _mm_storeu_ps(&pMatrixOut[(x + 2)*nHeight + y], row2);
            _mm_storeu_ps(&pMatrixOut[(x + 3)*nHeight + y], row3);
        }
    }

This seems pretty tight...but that's a lot of multiplies. We can factor out most of the multiplies so that it's just pointer arithmetic and adds. On x64 the central loop can be accomplished without multiplies and without register spills.

Most of the matrix will be transposed by this set of loops. We will have to handle the rest in two other rectangular strips (making sure not to double-count the corner rectangle).

void TransposeNonSquareMatrix_SSE(const float *const pMatrixIn, float *const pMatrixOut, const size_t nWidth, const size_t nHeight)
{
    for (size_t y = nHeight; y & 3;)
    {
        y--;
        for (size_t x = nWidth; x-- > 0;)
        {
            pMatrixOut[x*nHeight + y] = pMatrixIn[y*nWidth + x];
        }
    }
    if (nWidth & 3)
    {
        for (size_t y = nHeight&~3; y-- > 0;)
        {
            for (size_t x = nWidth; x & 3;)
            {
                x--;
                pMatrixOut[x*nHeight + y] = pMatrixIn[y*nWidth + x];
            }
        }
    }
    const size_t nHeightMax = nHeight&~3;
    const size_t nWidthMax = nWidth&~3;
    for (size_t y = nHeightMax; y >0;)
    {
        y -= 4;
        const float *const pIn0 = pMatrixIn + y*nWidth;
        const float *const pIn1 = pIn0 + nWidth;
        const float *const pIn2 = pIn1 + nWidth;
        const float * const pIn3 = pIn2 + nWidth;
        float * pOut = &pMatrixOut[nWidthMax*nHeight + y];
        for (size_t x = nWidthMax; x>0;)
        {
            x -= 4;
            __m128 row0;
            __m128 row1;
            __m128 row2;
            __m128 row3;
            row0 = _mm_loadu_ps(&pIn0[x]);
            row1 = _mm_loadu_ps(&pIn1[x]);
            row2 = _mm_loadu_ps(&pIn2[x]);
            row3 = _mm_loadu_ps(&pIn3[x]);
            _MM_TRANSPOSE4_PS(row0, row1, row2, row3);
            _mm_storeu_ps((pOut -= nHeight), row3);
            _mm_storeu_ps((pOut -= nHeight), row2);
            _mm_storeu_ps((pOut -= nHeight), row1);
            _mm_storeu_ps((pOut -= nHeight), row0);
        }
    }
}

The assembly compilation of the central loop is pretty tight:
         {
            x -= 4;
            __m128 row0;
            __m128 row1;
            __m128 row2;
            __m128 row3;
            row0 = _mm_loadu_ps(&pIn0[x]);
            row1 = _mm_loadu_ps(&pIn1[x]);
000000013F401210 movups      xmm0,xmmword ptr [r10-10h]
            row2 = _mm_loadu_ps(&pIn2[x]);
            row3 = _mm_loadu_ps(&pIn3[x]);
            _MM_TRANSPOSE4_PS(row0, row1, row2, row3);
            _mm_storeu_ps((pOut -= nHeight), row3);
000000013F401215 sub         rdx,rbp
000000013F401218 lea         r10,[r10-10h]
000000013F40121C movups      xmm4,xmmword ptr [r10+r15]
000000013F401221 movups      xmm3,xmmword ptr [r10+r11]
000000013F401226 movups      xmm1,xmmword ptr [r10+rcx]
000000013F40122B movaps      xmm5,xmm4
000000013F40122E movaps      xmm2,xmm3
000000013F401231 shufps      xmm4,xmm0,0EEh
000000013F401235 shufps      xmm5,xmm0,44h
            row2 = _mm_loadu_ps(&pIn2[x]);
            row3 = _mm_loadu_ps(&pIn3[x]);
            _MM_TRANSPOSE4_PS(row0, row1, row2, row3);
            _mm_storeu_ps((pOut -= nHeight), row3);
000000013F401239 movaps      xmm0,xmm4
000000013F40123C shufps      xmm3,xmm1,0EEh
000000013F401240 shufps      xmm0,xmm3,0DDh
000000013F401244 movups      xmmword ptr [rdx],xmm0
            _mm_storeu_ps((pOut -= nHeight), row2);
000000013F401247 sub         rdx,rbp
000000013F40124A shufps      xmm2,xmm1,44h
000000013F40124E movaps      xmm0,xmm5
000000013F401251 shufps      xmm4,xmm3,88h
000000013F401255 shufps      xmm0,xmm2,0DDh
000000013F401259 movups      xmmword ptr [rdx],xmm4
            _mm_storeu_ps((pOut -= nHeight), row1);
000000013F40125C sub         rdx,rbp
000000013F40125F shufps      xmm5,xmm2,88h
000000013F401263 movups      xmmword ptr [rdx],xmm0
            _mm_storeu_ps((pOut -= nHeight), row0);
000000013F401266 sub         rdx,rbp
000000013F401269 movups      xmmword ptr [rdx],xmm5
000000013F40126C sub         rax,1
000000013F401270 jne         TransposeNonSquareMatrix_SSE+1A0h (013F401210h)
The compiler has split the loop variable into two, rax and r10. The final shuffles from the transpose macro are interleaved between the unaligned stores.

Monday, January 2, 2017

Gauss-Jordan Elimination - Part2

Even after the last post, there's still a bunch of code necessary to get Gauss-Jordan elimination working. Most of it is pretty mundane nuts and bolts. But let's be thorough.

We need to multiply a vector in place by a runtime float:
void mul_ps_SSE2(float * const pA, size_t cAOrig, const float fac)
{
    size_t uiA = cAOrig;
    while (uiA & 3)
    {
        pA[--uiA] *= fac;
    }
    if (uiA)
    {
        __m128 mFac = _mm_set1_ps(fac);
        do {
            uiA -= 4;
            _mm_storeu_ps(pA + uiA,_mm_mul_ps(_mm_loadu_ps(pA + uiA), mFac));
        } while (uiA > 0);
    }
}
We have a need to add a vector multiplied by a runtime float to another vector:
void addVectorTimesConstant_ps_SSE2(float * const pDest, const float * const pProd, size_t cAOrig, const float dum)
{
    size_t uiA = cAOrig;
    while (uiA & 3)
    {
        uiA--;
        pDest[uiA] += pProd[uiA] * dum;
    }
    if (uiA)
    {
        const __m128 mDum = _mm_set1_ps(dum);
        do {
            uiA -= 4;
            __m128 mDest = _mm_loadu_ps(pDest + uiA);
            __m128 mProd = _mm_loadu_ps(pProd + uiA);
            __m128 mResult = _mm_add_ps(mDest, _mm_mul_ps(mProd, mDum));
            _mm_storeu_ps(pDest + uiA, mResult);
        } while (uiA > 0);
    }
}

While it's not huge, we can also SSE the assignment of a runtime int to a vector. Casting would allow this to be used for single precision floats.
void setVector_SSE2(int * const pDest, const size_t cAOrig, int iVal)
{
    size_t uiA = cAOrig;
    while (uiA & 3)
    {
        uiA--;
        pDest[uiA] = iVal;
    }
    if (uiA)
    {
        const __m128i mVal= _mm_set1_epi32(iVal);
        do {
            uiA -= 4;
            _mm_storeu_si128((__m128i *)pDest + uiA, mVal);
        } while (uiA > 0);
    }
}
Swapping the contents of two vectors isn't too difficult:

template <class T>
void swap(T &a, T&b)
{
    T temp = a;
    a = b;
    b = temp;
}
void swapArray_SSE2(int * const pA, int * const pB, const size_t cAOrig)
{
    size_t uiA = cAOrig;
    if (uiA & 1)
    {
        uiA--;
        swap(pA[uiA], pB[uiA]);
    }
    if (uiA & 2)
    {
        uiA-=2;
        swap(* reinterpret_cast<__int64*>(pA+uiA), *reinterpret_cast<__int64*>(pA + uiA));
    }
    while (uiA > 0)
    {
        uiA -= 4;
        __m128i mA = _mm_loadu_si128((__m128i*)(pA + uiA));
        __m128i mB = _mm_loadu_si128((__m128i*)(pB + uiA));
        _mm_storeu_si128((__m128i*)(pA + uiA), mB);
        _mm_storeu_si128((__m128i*)(pB + uiA), mA);
    }
}
void swapArray_SSE2(float * const pA, float * const pB, const size_t cAOrig)
{
    swapArray_SSE2(reinterpret_cast<int*>(pA), reinterpret_cast<int*>(pB), cAOrig);
}

With these additional helper method it isn't too difficult to translate the implementation of gaussj from Numerical Recipes.

void gaussj_SSE2(float **a, const UINT n, float **b, const UINT m)
{
    int *ipiv = (int*)malloc(n * sizeof(int));
    UINT *indxc = (UINT*)malloc(n * sizeof(UINT));
    UINT *indxr = (UINT*)malloc(n * sizeof(UINT));
    setVector_SSE2(ipiv, n, 0x7fffffff);
    for (size_t i = n; i-->0; i++)
    {
        float big = 0;
        size_t irow = 0, icol = 0;
        for (size_t j = n; j-- >0; )
        {
            if (ipiv[j])
            {
                const float *pfBestInRow = findLargestMaskedFloat_SSE2(ipiv, a[j], (UINT)n);
                float fBestValInRow = fabsf(*pfBestInRow);
                if (fBestValInRow > big)
                {
                    big = fBestValInRow;
                    irow = j;
                    icol = (size_t)(pfBestInRow - a[j]);
                }
            }
        }
        ipiv[icol] = 0;
        if (irow != icol)
        {
            swapArray_SSE2(a[irow], a[icol], n);
            swapArray_SSE2(b[irow], b[icol], m);
        }
        indxr[i] = (UINT)irow;
        indxc[i] = (UINT)icol;
        if (0 == a[icol][icol])
        {
            assert(0 && "gaussj : Singular Matrix");
            return;
        }
        float pivinv = 1.0f / a[icol][icol];
        a[icol][icol] = 1;
        mul_ps_SSE2(a[icol], n, pivinv);
        mul_ps_SSE2(b[icol], m, pivinv);
        for (size_t ll = n; ll-- >0; )
        {
            if (ll != icol)
            {
                float dum = -a[ll][icol];
                a[ll][icol] = 0;
                addVectorTimesConstant_ps_SSE2(a[ll], a[icol], n, dum);
                addVectorTimesConstant_ps_SSE2(b[ll], b[icol], m, dum);
            }
        }
    }
    for (size_t l = n ; l-- > 0; )
    {
        if (indxr[l] != indxc[l])
        {
            for (size_t k = 0; k < n; k++)
            {
                swap(a[k][indxr[l]], a[k][indxc[l]]);
            }
        }
    }
    free(indxc);
    free(indxr);
    free(ipiv);
}

Sunday, January 1, 2017

Gauss-Jordan elimination - Part1

Gauss-Jordan elimination is taught in high school algebra class. There's an implementation of it in Numerical Recipes, that is used as part of the general linear regression.

While the book identifies that this isn't the greatest method for solving equations, it is instructive since we are familiar with the code.

The first chunk identifies the largest magnitude element that hasn't already been processed:
        float big = 0;
        int irow = 0, icol = 0;
        for (int j = 0; j < n; j++)
        {
            if (ipiv[j] == 0)
            {
                for (int k = 0; k < n; k++)
                {
                    if (ipiv[k] == 0)
                    {
                        if (fabs(a[j][k]) >= big)
                        {
                            big = fabsf(a[j][k]);
                            irow = j;
                            icol = k;
                        }
                    }
                }
            }
        }
        ipiv[icol] ++;

If we combine the fabs with the conditional this code looks a lot like the code for finding the maximum integer in an unsorted list.

One pattern that was mentioned before is masking away the sign bit to take the absolute value. We can combine this with masking away disallowed values.

const float * findLargestMaskedFloat_Ansi(const int *pMask, const float * const pA, const UINT cAOrig)
{
    if (cAOrig <= 0)
    {
        return nullptr;
    }
    UINT cARemaining = cAOrig;
    cARemaining--;
    float fBestVal;
    int &iBestVal = *reinterpret_cast<int *>(&fBestVal);
    float fOnVal;
    int &iOnVal = *reinterpret_cast<int *>(&fOnVal);
    iBestVal = pMask[cARemaining] & *reinterpret_cast<const int *>(&pA+ cARemaining);
    const float *pBest = pA + cARemaining;
    while (cARemaining > 0)
    {
        cARemaining--;
        iOnVal = pMask[cARemaining] & *reinterpret_cast<const int *>(&pA + cARemaining);
        if (fOnVal > fBestVal)
        {
            pBest = pA + cARemaining;
            iBestVal = iOnVal;
        }
    }
    return pBest;
}
The SSE code will look a lot like the maximum value code from a previous post:

const float *findLargestMaskedFloat_SSE2(const int *pMask, const float *pA, UINT cAOrig)
{
    if (cAOrig >= 4)
    {
        UINT uiA = cAOrig;
        __m128 mBestValues = _mm_and_ps(_mm_loadu_ps((const float *)pMask), _mm_loadu_ps(pA));
        __m128i mBestIndexBase = _mm_setzero_si128();
        while (uiA > 4)
        {
            uiA -= 4;
            __m128 mValues = _mm_and_ps(_mm_loadu_ps((const float *)&pMask[uiA]), _mm_loadu_ps(&pA[uiA]));
            __m128 mCompare = _mm_cmpgt_ps(mValues, mBestValues);
            if (_mm_movemask_ps(mCompare))
            {
                __m128i mIndexBase = _mm_set1_epi32(uiA);
                mBestValues = _mm_blend_ps_SSE2(mValues, mBestValues, mCompare);
                mBestIndexBase = _mm_blend_si128_SSE2(mIndexBase, mBestIndexBase, _mm_castps_si128(mCompare));
            }
        }
        __m128i mBestIndex = _mm_add_epi32(mBestIndexBase, _mm_setr_epi32(0, 1, 2, 3));
        __m128 mValues;
        __m128 mCompare;
        mValues = _mm_castsi128_ps( _mm_shuffle_epi32(_mm_castps_si128(mBestValues), _MM_SHUFFLE(2, 3, 2, 3)));
        mCompare = _mm_cmpgt_ps(mValues, mBestValues);
        mBestValues = _mm_blend_ps_SSE2(mValues, mBestValues, mCompare);
        mBestIndex = _mm_blend_si128_SSE2(_mm_shuffle_epi32(mBestIndex, _MM_SHUFFLE(2, 3, 2, 3)), mBestIndex, _mm_castps_si128(mCompare));
        mValues = _mm_castsi128_ps(_mm_shuffle_epi32(_mm_castps_si128(mBestValues), _MM_SHUFFLE(1, 1, 1, 1)));
        mCompare = _mm_cmpgt_ps(mValues, mBestValues);
        // mBestValues = _mm_blend_si128_SSE2(mValues, mBestValues, mCompare);
        mBestIndex = _mm_blend_si128_SSE2(_mm_shuffle_epi32(mBestIndex, _MM_SHUFFLE(1, 1, 1, 1)), mBestIndex, _mm_castps_si128(mCompare));
        return pA + _mm_cvtsi128_si32(mBestIndex);
    }
    return findLargestMaskedFloat_Ansi(pMask, pA, cAOrig);
}
This will require a little more initialization.
    for (int i = n; i-- > 0;)
    {
        ipiv[i] = 0x7fffffff;
    }
...
    float big = 0;
    int irow = 0, icol = 0;
    for (int j = 0; j < n; j++)
    {
        if (ipiv[j] )
        {
            const float *pfBestInRow = findLargestMaskedFloat_SSE2(ipiv, a[j], n);
            float fBestValInRow = fabsf(*pfBestInRow);
            if (fBestValInRow > big)
            {
                big = fBestValInRow;
                irow = j;
                icol = pfBestInRow-a[j];
            }
        }
    }
    ipiv[icol] =0;

The initialization code looks like it could be made SSE tighter. But let's tackle the next chunks in a future post.