Saturday, November 4, 2017

number of leading zeroes in SSE2

The number of leading zeroes is often useful when doing division-like operations.  It's also particularly useful when each bit has been used to represent some sort of condition (similar to what was seen in the strlen post). 

The usefulness of this was formalized in the instruction set with _mm_lzcnt_epi32 and _lzcnt_u32.  But the __m128i processor instruction requires that the processor supports both AVX512VL + AVX512CD, and I don't own one of those computers yet.  The bitscan instructions have been available for over a decade, but don't handle __m128i

Here's a baseline implementation and an intrinsic instruction implementation.

unsigned nlz_baseline(unsigned x)
{
    unsigned uRet = 32;
    while (x)
    {
        uRet--;
        x = x>>1;
    }
    return uRet;
}


unsigned long nlz_bsr(unsigned long mask)
{
    unsigned long index;
    if (_BitScanReverse(&index, mask))
    {
        return 31 - index;
    }
    return 32;
}


The book "Hacker's delight" has a branch free implementation that translates directly to SSE2.  The lack of a conversion operator between epu32 and single floats in the SSE2 instruction set makes many of the bit re-interpretation strategies not work.

unsigned nlz(unsigned x)
{
    int y, m, n;
    y = -(x >> 16);
    m = (y >> 16) & 16;
    n = 16 - m;
    x = x >> m;

    y = x - 0x100;
    m = (y >> 16) & 8;
    n = n + m;
    x = x << m;

    y = x - 0x1000;
    m = (y >> 16) & 4;
    n = n + m;
    x = x << m;

    y = x - 0x4000;
    m = (y >> 16) & 2;
    n = n + m;
    x = x << m;

    y = x >> 14;
    m = y&~(y >> 1);
    return n + 2 - m;
}


__m128i nlz_SSE2(const __m128i &xIn)
{
    __m128i y, m, n;

    y = _mm_sub_epi32(_mm_setzero_si128(),_mm_srli_epi32(xIn, 16));
    m = _mm_and_si128(_mm_set1_epi32(16), _mm_srai_epi32(y, 16));
    n = _mm_sub_epi32(_mm_set1_epi32(16),m);
    __m128i x = _mm_srlv_epi32(xIn, m);

    y = _mm_sub_epi32(x, _mm_set1_epi32(0x100));
    m = _mm_and_si128(_mm_srai_epi32(y, 16), _mm_set1_epi32(8));
    n = _mm_add_epi32(n, m);
    x = _mm_sllv_epi32(x, m);

    y = _mm_sub_epi32(x, _mm_set1_epi32(0x1000));
    m = _mm_and_si128(_mm_srai_epi32(y, 16), _mm_set1_epi32(4));
    n = _mm_add_epi32(n, m);
    x = _mm_sllv_epi32(x, m);

    y = _mm_sub_epi32(x, _mm_set1_epi32(0x4000));
    m = _mm_and_si128(_mm_srai_epi32(y, 16), _mm_set1_epi32(2));
    n = _mm_add_epi32(n, m);
    x = _mm_sllv_epi32(x, m);

    y = _mm_srli_epi32(x, 14);
    m = _mm_andnot_si128(_mm_srai_epi32(y, 1),y);
    __m128i mRet = _mm_sub_epi32(_mm_add_epi32(n, _mm_set1_epi32(2)),m);
   
    return mRet;
}


Since this is written only for 32 bit integer numbers, every possible 32bit number is confirmed against the baseline implementation.