Wednesday, December 20, 2017

Converting integers to larger integer types

We often need to sign extend a number into a larger type.  To keep the same value we sign extend. Unsigned numbers do not have a sign bit, so will be left padded with zeros.

The unsigned versions are pretty obvious:

void _mm_cvtepu32_epi64(const __m128i &mIn, __m128i &mOutLo, __m128i &mOutHi)
{
 mOutLo = _mm_unpacklo_epi32(mIn, _mm_setzero_si128());
 mOutHi = _mm_unpackhi_epi32(mIn, _mm_setzero_si128());
}

void _mm_cvtepu16_epi32(const __m128i &mIn, __m128i &mOutLo, __m128i &mOutHi)
{
 mOutLo = _mm_unpacklo_epi16(mIn, _mm_setzero_si128());
 mOutHi = _mm_unpackhi_epi16(mIn, _mm_setzero_si128());
}

void _mm_cvtepu8_epi16(const __m128i &mIn, __m128i &mOutLo, __m128i &mOutHi)
{
 mOutLo = _mm_unpacklo_epi8(mIn, _mm_setzero_si128());
 mOutHi = _mm_unpackhi_epi8(mIn, _mm_setzero_si128());


The signed version aren't supported directly until later iterations of the instruction set with  _mm_cvtepi32_epi64 and other similar instructions, but it isn't too difficult produce from the base intrinsics.

void _mm_cvtepi16_epi32(const __m128i &mIn, __m128i &mOutLo, __m128i &mOutHi)
{
 __m128i mDupedLo = _mm_unpacklo_epi16(mIn, mIn);
 __m128i mDupedHi = _mm_unpackhi_epi16(mIn, mIn);
 mOutLo = _mm_srai_epi32(mDupedLo , 16);
 mOutHi = _mm_srai_epi32(mDupedHi , 16);
}

void _mm_cvtepi8_epi16(const __m128i &mIn, __m128i &mOutLo, __m128i &mOutHi)
{
 __m128i mDupedLo = _mm_unpacklo_epi8(mIn, mIn);
 __m128i mDupedHi = _mm_unpackhi_epi8(mIn, mIn);
 mOutLo = _mm_srai_epi16(mDupedLo , 8);
 mOutHi = _mm_srai_epi16(mDupedHi , 8);
}


But since there isn't a 64bit arithmetic shift, we need to do something different for the promotion from 32bit signed to 64 bit signed.  We extend the sign bit from each of the 4 lanes to all 32 bits of the lane.

Then we interleave the 32 bit results from the signs and values into the final 64 bit outputs.

void _mm_cvtepi32_epi64(const __m128i &mIn, __m128i &mOutLo, __m128i &mOutHi)
{
 __m128i mSigns = _mm_srai_epi32(mIn, 31);
 mOutLo = _mm_unpacklo_epi32(mIn, mSigns);
 mOutHi = _mm_unpackhi_epi32(mIn, mSigns);
}


This requires 3 registers rather than the 2 registers of the other width version.