MicroPerf: transposing a square matrix of floats in-place.

When evaluating a matrix multiplication, it can sometimes be easier to transpose the right hand matrix as a first step. Then each of the entries in the resulting matrix will just be a dot product, which SSE is very good at.

Transposing a matrix is really easy.  Transposing a square matrix in-place is really easy. Transposing a non-square matrix in-place is surprisingly difficult. Knuth and other have written extensively on the subject.

The SSE intrinsics headers include a macro for transposing a 4x4 square matrix. It is not difficult to extend this up to arbitrary square size:

void TransposeSquareMatrix_SSE(float *const pMatrix, const UINT nWidth)
{
    UINT y = nWidth;
    while (y & 3)
    {
        y--;
        UINT x = y;
        while (x >0)
        {
            x--;
            swap(pMatrix[x*nWidth + y], pMatrix[y*nWidth + x]);
        }
    }
    while (y >0)
    {
        y -= 4;
        UINT x = y;
        {
            __m128 row0;
            __m128 row1;
            __m128 row2;
            __m128 row3;
            row0 = _mm_loadu_ps(&pMatrix[(y + 0)*nWidth + x]);
            row1 = _mm_loadu_ps(&pMatrix[(y + 1)*nWidth + x]);
            row2 = _mm_loadu_ps(&pMatrix[(y + 2)*nWidth + x]);
            row3 = _mm_loadu_ps(&pMatrix[(y + 3)*nWidth + x]);
            _MM_TRANSPOSE4_PS(row0, row1, row2, row3);
            _mm_storeu_ps(&pMatrix[(x + 0)*nWidth + y], row0);
            _mm_storeu_ps(&pMatrix[(x + 1)*nWidth + y], row1);
            _mm_storeu_ps(&pMatrix[(x + 2)*nWidth + y], row2);
            _mm_storeu_ps(&pMatrix[(x + 3)*nWidth + y], row3);
        }
        while (x >0)
        {
            x -= 4;
            __m128 row0;
            __m128 row1;
            __m128 row2;
            __m128 row3;
            __m128 row4;
            __m128 row5;
            __m128 row6;
            __m128 row7;
            row0 = _mm_loadu_ps(&pMatrix[(y + 0)*nWidth + x]);
            row1 = _mm_loadu_ps(&pMatrix[(y + 1)*nWidth + x]);
            row2 = _mm_loadu_ps(&pMatrix[(y + 2)*nWidth + x]);
            row3 = _mm_loadu_ps(&pMatrix[(y + 3)*nWidth + x]);
            row4 = _mm_loadu_ps(&pMatrix[(x + 0)*nWidth + y]);
            row5 = _mm_loadu_ps(&pMatrix[(x + 1)*nWidth + y]);
            row6 = _mm_loadu_ps(&pMatrix[(x + 2)*nWidth + y]);
            row7 = _mm_loadu_ps(&pMatrix[(x + 3)*nWidth + y]);
            _MM_TRANSPOSE4_PS(row0, row1, row2, row3);
            _MM_TRANSPOSE4_PS(row4, row5, row6, row7);
            _mm_storeu_ps(&pMatrix[(y + 0)*nWidth + x], row4);
            _mm_storeu_ps(&pMatrix[(y + 1)*nWidth + x], row5);
            _mm_storeu_ps(&pMatrix[(y + 2)*nWidth + x], row6);
            _mm_storeu_ps(&pMatrix[(y + 3)*nWidth + x], row7);
            _mm_storeu_ps(&pMatrix[(x + 0)*nWidth + y], row0);
            _mm_storeu_ps(&pMatrix[(x + 1)*nWidth + y], row1);
            _mm_storeu_ps(&pMatrix[(x + 2)*nWidth + y], row2);
            _mm_storeu_ps(&pMatrix[(x + 3)*nWidth + y], row3);
        }
    }
}
This makes the assumption that the matrix is a single contiguous block of memory. The areas that can't be expressed as 4x4 transposes are handled first.
Then 4x4 blocks are performed. The 4x4 blocks along the diagonal are transposed in place. The 4x4 blocks not on the diagonal are transposed, then swapped across the diagonal.

We will see much more complicated transposes soon.

MicroPerf

Sunday, December 25, 2016

transposing a square matrix of floats in-place.

No comments:

Post a Comment