MicroPerf: Transposing a bit matrix using SSE2

In order to make a error correcting code more resilient to burst errors, it's not uncommon to interleave bits from multiple encoded words.

Intuitively we can think of making a 2d matrix with each row being a separate codeword. Interleaving the bits is then just taking the transpose of the bit matrix.

The book "Hacker's Delight" has an ansi algorithm for transposing a 8x8 bit matrix (64 bits total):

unsigned __int64 transpose8(unsigned __int64 x)
{
x = (x & 0xAA55AA55AA55AA55LL) | ((x & 0x00AA00AA00AA00AALL) << 7) | ((x >> 7) & 0x00AA00AA00AA00AALL);
x = (x & 0xCCCC3333CCCC3333LL) | ((x & 0x0000CCCC0000CCCCLL) << 14) | ((x >> 14) & 0x0000CCCC0000CCCCLL);
x = (x & 0xF0F0F0F00F0F0F0FLL) | ((x & 0x00000000F0F0F0F0LL) << 28) | ((x >> 28) & 0x00000000F0F0F0F0LL);
return x;
}

This can be directly translated into SSE code, which will transpose two separate 8x8 matrices at the same time:

__m128i BitMatrixTranspose_8x8_SSE(const __m128i &xIn)
{
const __m128i c1 = _mm_set1_epi64x(0xAA55AA55AA55AA55LL);
const __m128i c2 = _mm_set1_epi64x(0x00AA00AA00AA00AALL);
const __m128i c3 = _mm_set1_epi64x(0xCCCC3333CCCC3333LL);
const __m128i c4 = _mm_set1_epi64x(0x0000CCCC0000CCCCLL);
const __m128i c5 = _mm_set1_epi64x(0xF0F0F0F00F0F0F0FLL);
const __m128i c6 = _mm_set1_epi64x(0x00000000F0F0F0F0LL);

const __m128i x1 = _mm_or_si128(_mm_and_si128(xIn, c1), _mm_or_si128(_mm_slli_epi64(_mm_and_si128(xIn, c2), 7), _mm_and_si128(_mm_srli_epi64(xIn, 7), c2)));
const __m128i x2 = _mm_or_si128(_mm_and_si128(x1, c3), _mm_or_si128(_mm_slli_epi64(_mm_and_si128(x1, c4), 14), _mm_and_si128(_mm_srli_epi64(x1, 14), c4)));
const __m128i x3 = _mm_or_si128(_mm_and_si128(x2, c5), _mm_or_si128(_mm_slli_epi64(_mm_and_si128(x2, c6), 28), _mm_and_si128(_mm_srli_epi64(x2, 28), c6)));

return x3;
}

A transpose operation on a square matrix is self inverse. So the above function is self inverse.

This can then be extended to take the transpose of an 8x16 matrix and a 16x8 matrix (inverse functions of each other).

__m128i unpack8_SSE(const __m128i&xIn)
{
__m128i mRet = _mm_unpacklo_epi8(xIn, _mm_shuffle_epi32(xIn, _MM_SHUFFLE(3, 2, 3, 2)));
return mRet;
}

__m128i repack8_SSE(const __m128i&xIn)
{
return unpack8_SSE(unpack8_SSE(unpack8_SSE(xIn)));
}

__m128i BitMatrixTranspose_8x16_SSE(const __m128i &xIn)
{
__m128i mTransposedHalves = BitMatrixTranspose_8x8_SSE(xIn);
__m128i mRet = unpack8_SSE(mTransposedHalves);
return mRet;
}

__m128i BitMatrixTranspose_16x8_SSE(const __m128i &xIn)
{
__m128i mUnshuffle = repack8_SSE(xIn);
__m128i mRet = BitMatrixTranspose_8x8_SSE(mUnshuffle);
return mRet;
}

It's also worth noting that unpack8_SSE and repack8_SSE are inverses of each other.
This could be simplified using the _mm_shuffle_epi8 instruction. But that's a different problem.

MicroPerf

Friday, October 27, 2017

Transposing a bit matrix using SSE2

No comments:

Post a Comment