Tuesday, December 20, 2016

A sorting network for eight 32-bit numbers in SSE4.1

One of the fundamental operations in computer science is sorting.  And we are all familiar with a bunch of the algorithms like quicksort.  It's really hard to do a quicksort with SSE.  But some other sorting algorithms do work.

I see a lot of people talking about sorting networks using SSE, some pseudo-code, and some diagrams, but I have yet to see any code samples online for this.  And the devil is always in the details.

Let's start with sorting 4 elements, taking advantage of _mm_max_epi32, though the code would work equally well with floats, needing only SSE2, using the _mm_max_ps group of instruction.
Like complex multiplication, performing two short sorts at the same time isn't much more work than sorting one.

The sorting network for four elements is three levels (each level is one set of min/max).  The sorting network for eight elements is six levels deep.

void ShortSort_SSE41(const __m128i &a, const __m128i &b, __m128i &aOut, __m128i &bOut)
{
    __m128i ab_lo = _mm_unpacklo_epi64(a, b);
    __m128i ab_hi = _mm_unpackhi_epi64(a, b);
    __m128i ab1_min = _mm_min_epi32(ab_lo, ab_hi);
    __m128i ab1_max = _mm_max_epi32(ab_lo, ab_hi);
    __m128i a1 = ab1_min;
    __m128i b1 = _mm_shuffle_epi32(ab1_max, _MM_SHUFFLE(2, 3, 0, 1));
    __m128i ab2_min = _mm_min_epi32(a1, b1);
    __m128i ab2_max = _mm_max_epi32(a1, b1);
    __m128i a2 = _mm_unpacklo_epi32(ab2_min, ab2_max);
    __m128i b2 = _mm_unpackhi_epi32(ab2_min, ab2_max);
    __m128i ab3_lo = _mm_unpacklo_epi64(a2, b2);
    __m128i ab3_hi = _mm_unpackhi_epi64(a2, b2);
    __m128i ab3_min = _mm_min_epi32(ab3_lo, ab3_hi);
    __m128i ab3_max = _mm_max_epi32(ab3_lo, ab3_hi);
    aOut = _mm_unpacklo_epi32(ab3_min, ab3_max);
    bOut = _mm_unpackhi_epi32(ab3_min, ab3_max);
}


Like complex number multiplication, it is about the same amount of work to do two at the same time.  So the above ShortSort sorts two separate four element arrays into increasing monotonic order.
Note that this code will work passing the same variable as both input and output for each of the lists.

__forceinline void SortLevel_SSE41(const __m128i &a, const __m128i &b, __m128i &ab_left, __m128i &ab_right)
{
    __m128i ab1_min = _mm_min_epi32(a, b);
    __m128i ab1_max = _mm_max_epi32(a, b);
    ab_left = _mm_unpacklo_epi32(ab1_min, ab1_max);
    ab_right = _mm_unpackhi_epi32(ab1_min, ab1_max);
}

__forceinline __m128i Reverse_epi32_SSE2(const __m128i &a)
{
    return _mm_shuffle_epi32(a, _MM_SHUFFLE(0, 1, 2, 3));
}

void Merge_SSE41(const __m128i &a, const __m128i &b, __m128i &ab_left, __m128i &ab_right)
{
    __m128i ab1_left, ab1_right;
    __m128i ab2_left, ab2_right;
    SortLevel_SSE41(a, Reverse_epi32_SSE2(b), ab1_left, ab1_right);
    SortLevel_SSE41(ab1_left, ab1_right, ab2_left, ab2_right);
    SortLevel_SSE41(ab2_left, ab2_right, ab_left, ab_right);
}


This takes two monotonic lists as input, and returns the sorted merged list in the two output variables.
As before, the same variable could be used for both input and output.
This is largely based on the network in the standard computer science textbook "Introduction to Algorithms" by Cormen, Leiserson, and Rivest.  This does diverge from the text in the final level.  The same level being used three times is just too good not to use.  The code is tested using the 0/1 method described in the textbook.

A call to these would be something like:

        int a[8] =....
        __m128i aa, bb;
        aa = _mm_loadu_si128((const __m128i *)&a[0]);
        bb = _mm_loadu_si128((const __m128i *)&a[4]);
        ShortSort_SSE41(aa, bb, aa, bb);
        Merge_SSE41(aa, bb, aa, bb);
        _mm_storeu_si128((__m128i *)&a[0], aa);
        _mm_storeu_si128((__m128i *)&a[4], bb);


 It's not a general sorting solution, but it seems like it could be useful.

No comments:

Post a Comment