MicroPerf: 2016

Thursday, December 29, 2016

a few words about loop direction

The following two loops produces the same output in opposite order.

void doLoops(unsigned int *piNumIterations, unsigned int *pOut)
{
    for (unsigned int uiLoopControl = 0; uiLoopControl < *piNumIterations; uiLoopControl++)
    {
        *pOut = uiLoopControl;
        printf("%d\n", uiLoopControl);
    }
    for (unsigned int uiLoopControl = *piNumIterations; uiLoopControl-- >0; )
    {
        *pOut = uiLoopControl;
        printf("%d\n", uiLoopControl);
    }
}

Aliasing is the reality that two different pointers may point to the same memory. In this case piNumIterations and pOut could point to the same underlying UINT. In the first loop, this means that the compiler is obligated to compare the loop variant against memory in each iteration.

    for (unsigned int uiLoopControl = 0; uiLoopControl < *piNumIterations; uiLoopControl++)
000000013FF6107F xor         ebx,ebx
000000013FF61081 mov         rdi,rdx
000000013FF61084 mov         rsi,rcx
000000013FF61087 cmp         dword ptr [rcx],ebx
000000013FF61089 jbe         doLoops+36h (013FF610A6h)
000000013FF6108B nop         dword ptr [rax+rax]
    {
        *pOut = uiLoopControl;
        printf("%d\n", uiLoopControl);
000000013FF61090 mov         edx,ebx
000000013FF61092 mov         dword ptr [rdi],ebx
000000013FF61094 lea         rcx,[string "%d\n" (013FF62200h)]
000000013FF6109B call        printf (013FF61010h)
000000013FF610A0 inc         ebx
000000013FF610A2 cmp         ebx,dword ptr [rsi]
000000013FF610A4 jb          doLoops+20h (013FF61090h)
    }
    for (unsigned int uiLoopControl = *piNumIterations; uiLoopControl-- >0; )
000000013FF610A6 mov         ebx,dword ptr [rsi]
000000013FF610A8 test        ebx,ebx
000000013FF610AA je          doLoops+56h (013FF610C6h)
000000013FF610AC nop         dword ptr [rax]
000000013FF610B0 dec         ebx
    {
        *pOut = uiLoopControl;
        printf("%d\n", uiLoopControl);
000000013FF610B2 lea         rcx,[string "%d\n" (013FF62200h)]
000000013FF610B9 mov         edx,ebx
000000013FF610BB mov         dword ptr [rdi],ebx
000000013FF610BD call        printf (013FF61010h)
000000013FF610C2 test        ebx,ebx
000000013FF610C4 jne         doLoops+40h (013FF610B0h)
    }
}
This means that the second loop can be preferable if only the number of iterations is important, not the order.
The second loop also uses one fewer registers inside the loop, so in more real-world scenarios is less likely to cause a register spill.
While the example above is done using pointers, the more real world case is the memory from a struct or class member being aliased. Some of this may be mitigated using the __restrict keyword.  It's better when possible to structure the code to allow the compiler to take care of the restriction.

Tuesday, December 27, 2016

Image convolution using a 4x4 filter

Some of the instructions are helpful in surprising ways.

SSSE3 introduced instructions for alignment. These have benefit for stream processing in 1 dimension. Linear filtering conceptually is a weighted average of each element multiplied by every element in the filter. The filter being zero outside a range allows this to be a tractable problem in the spatial (rather than frequency) domain.

This concept can be extended to multiple dimensions.

There are special cases of filters that are linearly separable. But this is not generally true.

The below code calculates a two dimensional image convolution for a 4x4 filter. The filter is treated as though the center is located at pixel (1,1)

void convolveGeneral_4x4Filter(const unsigned int iWidth, const unsigned int iHeight, const unsigned char * const pInputImage, unsigned char * const pOutputImage, const signed char * pFilter, const unsigned iWhichLine, const unsigned minY, const unsigned maxY, const int iFilterNormalize)
{
    for (unsigned int iY = minY; iY<=maxY; iY++)
    {
        int iPixelSum = 0;
        for (int mx = 0; mx <4; mx++)
        {
            for (int my =0;my<4;my++)
            {
                signed char FilterWeight = pFilter[my * 4 + mx];
                int iImageX = max(0, min((int)(iWhichLine + mx-1), (int)iWidth-1));
                int iImageY = max(0, min((int)(iY + my-1), (int)iHeight-1));
                iPixelSum += FilterWeight * pInputImage[iWidth*iImageY + iImageX];
            }
        }
        pOutputImage[iY*iWidth + iWhichLine] = iPixelSum / iFilterNormalize;
    }
}
There are special cases of the pixels outside the edges of the source image. There are a couple ways this could be dealt with. The code above handles this by treating pixels beyond the edge of the image as if they are continued at the same value indefinitely outside the image. Other possible choices are to treat pixels outside as black, or as reflections of the source image.

Performing this same convolution using SSE needs to be performed in columns. We shift in four pixels, and keep a running group. A dot product is performed against the group of 16 pixels as a single instruction. The code does not handle the special cases at the edges of the image.

void convolveLine_4x4Filter_SSE(const unsigned int iWidth, const unsigned int iHeight, const unsigned char * const pInputImage, unsigned char * const pOutputImage, const __m128i &mFilter, const unsigned iWhichLine, const int iFilterNormalize)
{
    __m128i mImagePixels = _mm_setzero_si128();
    const unsigned char *pInputPixelToLoad = &pInputImage[iWhichLine-1];
    for (unsigned int iY = 3; iY-->0; )
    {
        unsigned long ulPixels = *reinterpret_cast<const unsigned long*>(pInputPixelToLoad);
        __m128i mPixelsToAdd = _mm_cvtsi32_si128(ulPixels);
        mImagePixels = _mm_alignr_epi8(mPixelsToAdd, mImagePixels, 4);
        pInputPixelToLoad += iWidth;
    }
    unsigned char *pOutputPixel = &pOutputImage[iWhichLine+iWidth];
    for (UINT_PTR iY =(iHeight-3);iY-->0 ;)
    {
        unsigned long ulPixels = *reinterpret_cast<const unsigned long*>(pInputPixelToLoad);
        mImagePixels = _mm_alignr_epi8(_mm_cvtsi32_si128(ulPixels), mImagePixels,4);
        const int iPixel = horizontalSum_epi16_SSSE3(_mm_maddubs_epi16(mImagePixels, mFilter))/iFilterNormalize;
        *pOutputPixel = iPixel;
        pInputPixelToLoad += iWidth;
        pOutputPixel += iWidth;
    }
}

Monday, December 26, 2016

Further horizontal sums - SSE3 and SSSE3 implementations

Any horizontal processing is difficult in SSE. But it was made easier with two updates that happened, confusingly named SSE3 and SSSE3.

This allows for more compact versions of the horizontal sums needed at the end of dot products and such.

Each of the horizontal add instructions, including _mm_hadd_epi32 and _mm_hadd_ps, horizontally adds two adjacent elements of two parameters. Since we are looking for the sum of all four elements, this will require two calls to the instruction. The contribution of the second parameter to the result is ignored. But we don't want to involve an extra register, so we pass the same value in twice in each of the calls.

__forceinline float horizontalSum_SSE3(const __m128 &mABCD)
{
    __m128 mApB_CpD= _mm_hadd_ps(mABCD, mABCD);
    __m128 mApBpCpD = _mm_hadd_ps(mApB_CpD, mApB_CpD);
    return _mm_cvtss_f32(mApBpCpD);
}

__forceinline int horizontalSum_SSSE3(const __m128i &mABCD)
{
    __m128i mApB_CpD = _mm_hadd_epi32(mABCD, mABCD);
    __m128i mApBpCpD = _mm_hadd_epi32(mApB_CpD, mApB_CpD);
    return _mm_cvtsi128_si32(mApBpCpD);
}
There are getting to be a lot of different combinations of instruction sets. We will have to address this difficulty in an upcoming post.

Sunday, December 25, 2016

different ways to blend

SSE2 has the instruction _mm_andnot_si128 that is included for the purpose of allowing bitwise blend operations:

__forceinline __m128i _mm_bitwise_blend_v1_si128(__m128i a, __m128i b, __m128i mask)
{
return _mm_or_si128(_mm_and_si128(mask, b), _mm_andnot_si128(mask, a));
}

This gives bitwise a when mask is 0, and b when mask is 1.

This allows us to do things like minimum and maximum even without the _mm_max_epi32 instruction introduced with SSE41:

__m128i m_a_lt_b = _mm_cmplt_epi32(a, b);
__m128i m_min = _mm_bitwise_blend_v1_si128(b, a, m_a_lt_b);
__m128i m_max = _mm_bitwise_blend_v1_si128(a, b, m_a_lt_b);

There are other ways of performing the same blend operation:
__forceinline __m128i _mm_bitwise_blend_v2_si128(__m128i a, __m128i b, __m128i mask)
{
return _mm_xor_si128(a, _mm_and_si128(_mm_xor_si128(a, b), mask));
}

This would sometimes be better than the first version. The generated assembly of the first version would complete with mask in a register. The generated assembly of the second version has the benefit of leaving a in a register afterwards.

This benefit of the blend was formalized into the instruction set with the SSE41 instruction _mm_blendv_epi8 and others. When the processor supports SSE41 this is much better.

transposing a square matrix of floats in-place.

When evaluating a matrix multiplication, it can sometimes be easier to transpose the right hand matrix as a first step. Then each of the entries in the resulting matrix will just be a dot product, which SSE is very good at.

Transposing a matrix is really easy.  Transposing a square matrix in-place is really easy. Transposing a non-square matrix in-place is surprisingly difficult. Knuth and other have written extensively on the subject.

The SSE intrinsics headers include a macro for transposing a 4x4 square matrix. It is not difficult to extend this up to arbitrary square size:

void TransposeSquareMatrix_SSE(float *const pMatrix, const UINT nWidth)
{
    UINT y = nWidth;
    while (y & 3)
    {
        y--;
        UINT x = y;
        while (x >0)
        {
            x--;
            swap(pMatrix[x*nWidth + y], pMatrix[y*nWidth + x]);
        }
    }
    while (y >0)
    {
        y -= 4;
        UINT x = y;
        {
            __m128 row0;
            __m128 row1;
            __m128 row2;
            __m128 row3;
            row0 = _mm_loadu_ps(&pMatrix[(y + 0)*nWidth + x]);
            row1 = _mm_loadu_ps(&pMatrix[(y + 1)*nWidth + x]);
            row2 = _mm_loadu_ps(&pMatrix[(y + 2)*nWidth + x]);
            row3 = _mm_loadu_ps(&pMatrix[(y + 3)*nWidth + x]);
            _MM_TRANSPOSE4_PS(row0, row1, row2, row3);
            _mm_storeu_ps(&pMatrix[(x + 0)*nWidth + y], row0);
            _mm_storeu_ps(&pMatrix[(x + 1)*nWidth + y], row1);
            _mm_storeu_ps(&pMatrix[(x + 2)*nWidth + y], row2);
            _mm_storeu_ps(&pMatrix[(x + 3)*nWidth + y], row3);
        }
        while (x >0)
        {
            x -= 4;
            __m128 row0;
            __m128 row1;
            __m128 row2;
            __m128 row3;
            __m128 row4;
            __m128 row5;
            __m128 row6;
            __m128 row7;
            row0 = _mm_loadu_ps(&pMatrix[(y + 0)*nWidth + x]);
            row1 = _mm_loadu_ps(&pMatrix[(y + 1)*nWidth + x]);
            row2 = _mm_loadu_ps(&pMatrix[(y + 2)*nWidth + x]);
            row3 = _mm_loadu_ps(&pMatrix[(y + 3)*nWidth + x]);
            row4 = _mm_loadu_ps(&pMatrix[(x + 0)*nWidth + y]);
            row5 = _mm_loadu_ps(&pMatrix[(x + 1)*nWidth + y]);
            row6 = _mm_loadu_ps(&pMatrix[(x + 2)*nWidth + y]);
            row7 = _mm_loadu_ps(&pMatrix[(x + 3)*nWidth + y]);
            _MM_TRANSPOSE4_PS(row0, row1, row2, row3);
            _MM_TRANSPOSE4_PS(row4, row5, row6, row7);
            _mm_storeu_ps(&pMatrix[(y + 0)*nWidth + x], row4);
            _mm_storeu_ps(&pMatrix[(y + 1)*nWidth + x], row5);
            _mm_storeu_ps(&pMatrix[(y + 2)*nWidth + x], row6);
            _mm_storeu_ps(&pMatrix[(y + 3)*nWidth + x], row7);
            _mm_storeu_ps(&pMatrix[(x + 0)*nWidth + y], row0);
            _mm_storeu_ps(&pMatrix[(x + 1)*nWidth + y], row1);
            _mm_storeu_ps(&pMatrix[(x + 2)*nWidth + y], row2);
            _mm_storeu_ps(&pMatrix[(x + 3)*nWidth + y], row3);
        }
    }
}
This makes the assumption that the matrix is a single contiguous block of memory. The areas that can't be expressed as 4x4 transposes are handled first.
Then 4x4 blocks are performed. The 4x4 blocks along the diagonal are transposed in place. The 4x4 blocks not on the diagonal are transposed, then swapped across the diagonal.

We will see much more complicated transposes soon.

Friday, December 23, 2016

Polynomials - Horner Polynomial Evaluation for Sin(x) and Cos(x)

The SSE instruction set is missing a lot of the transcendental functions, like Sin and Cos.

But we can approximate them using polynomials. In particular polynomials can be efficiently evaluated in Horner form. This evaluates a monomial of order n using n multiplies, and n additions.

The coefficients for the polynomial can be calculated by a series of integrations. Mathematica makes this easy:

Mathematica code for generating approximating coefficients for function f. Changing the values of a and b changes the approximated range. Changing the range of n in the table generation changes the degree of the polynomial.

The calculations using these coefficients are then simple loops.

__m128 _mm_horner_ps(const __m128 &mx, const float * const coefs, const UINT cCoefs)
{
    if (cCoefs <= 0)
    {
        return _mm_setzero_ps();
    }
    UINT uiRemainingCoefs = cCoefs-1;
    __m128 mp = _mm_set1_ps(coefs[uiRemainingCoefs]);
    while (uiRemainingCoefs > 0)
    {
        uiRemainingCoefs--;
        mp = _mm_add_ps(_mm_mul_ps(mx, mp), _mm_set1_ps(coefs[uiRemainingCoefs]));
    }
    return mp;
}
__declspec(align(16)) static const UINT uiAbs[4] = { 0x7fffffff,0x7fffffff,0x7fffffff,0x7fffffff };
__m128 _mm_abs_ps(const __m128 mx)
{
    return _mm_and_ps(mx, _mm_load_ps(reinterpret_cast<const float *>(uiAbs)));
}
__m128 _mm_hornerEven_ps(const __m128 &mx, const float * const coefs, const UINT cCoefs)
{
    if (cCoefs <= 0)
    {
        return _mm_setzero_ps();
    }
    UINT uiRemainingCoefs = cCoefs - 1;
    __m128 ax = _mm_abs_ps(mx);
    __m128 mp = _mm_set1_ps(coefs[uiRemainingCoefs]);
    while (uiRemainingCoefs > 0)
    {
        uiRemainingCoefs--;
        mp = _mm_add_ps(_mm_mul_ps(ax, mp), _mm_set1_ps(coefs[uiRemainingCoefs]));
    }
    return mp;
}
__declspec(align(16)) static const UINT uiSign[4] = { 0x80000000,0x80000000,0x80000000,0x80000000 };
__m128 _mm_hornerOdd_ps(const __m128 &mx, const float * const coefs, const UINT cCoefs)
{
    if (cCoefs <= 0)
    {
        return _mm_setzero_ps();
    }
    UINT uiRemainingCoefs = cCoefs - 1;
    __m128 ax = _mm_abs_ps(mx);
    __m128 mSignX = _mm_andnot_ps(_mm_load_ps(reinterpret_cast<const float *>(uiAbs)), mx);
    __m128 mp = _mm_set1_ps(coefs[uiRemainingCoefs]);
    while (uiRemainingCoefs > 0)
    {
        uiRemainingCoefs--;
        mp = _mm_add_ps(_mm_mul_ps(ax, mp), _mm_set1_ps(coefs[uiRemainingCoefs]));
    }
    __m128 mResult = _mm_xor_ps(mp, mSignX);
    return mResult;
}

const float mCosCoefs[] = { 0.99856677684663489786f, 0, -0.49527656685594188504f, 0, 0.039207901766468234142f, 0, -0.00096832988461029163825f };
const float mCosEvenCoefs[] = { 0.99557640122455577189f, 0.069390161928493746222f, \
- 0.67150059990790674484f, 0.14249685301935530176f };
const float mSinOddCoefs[] = { 0.00059003711537529755905f, 0.98659497273144593606f, \
0.048989468371295105117f, -0.23111360502840283503f, \
0.036782872656058228926f };
This can then be called simply:
__m128 mCos = _mm_horner_ps(mx, mCosCoefs, ARRAYSIZE(mCosCoefs));
__m128 mCosEven = _mm_hornerEven_ps(mx, mCosEvenCoefs, ARRAYSIZE(mCosEvenCoefs));
__m128 mSinOdd= _mm_hornerOdd_ps(mx, mSinOddCoefs, ARRAYSIZE(mSinOddCoefs));

We can get slightly better approximations if we exploit the even/odd properties of the functions we are approximating.

Both the even and odd evaluation functions mask away the sign bits of the calling values. The odd evaluation function then applies the sign bits from the original calling values to the output result.

These methods with only a few coefficients give reasonable approximations over the range (-Pi,Pi).

Thursday, December 22, 2016

Fixed point arithmatic - 16.16 signed int multiplication - and the sigmoid function

In neural network code I've seen a lot of arithmetic using fixed point numbers. Ordinarily I've seen the low 16bits used for the fractional portion and the high 16bits used for the whole portion.

Addition and subtraction use the ordinary int addition and subtraction.

Multiplication of two 16.16 fixed point numbers is slightly more complicated:
#ifndef Int32x32To64
// While this is normally a single machine instruction x86-64 intrinsic, declare a backup for ansi compatibility...
__forceinline __int64 Int32x32To64(long a, long b)
{
    __int64 iRet = (((__int64)((long)(a))) * ((__int64)((long)(b))));
    return iRet;
}
#endif // !Int32x32To64
int iMul(const int a, const int b)
{
    int iRet = (int)(Int32x32To64(a, b) >> 16);
    return iRet;
}

Division of these numbers can similarly be expressed in terms of standard 64bit arithmetic.

It seems like it could be useful to do this in four parallel lanes using SSE. The code is very similar to _mm_mul_epi32_SSE2.

__m128i _mm_fixedMul_epi32_SSE41(const __m128i &x, const __m128i &y)
{
    __m128i mxy02i = _mm_mul_epi32(x, y);
    __m128i mxs = _mm_srli_si128(x, 4);
    __m128i mys = _mm_srli_si128(y, 4);
    __m128i mxy13i = _mm_mul_epi32(mxs,mys);
    __m128i mxy02i_s = _mm_srli_si128(mxy02i, 2);
    __m128i mxy13i_s = _mm_slli_si128(mxy13i, 2);
    __m128i mxy0123 = _mm_blend_epi16(mxy02i_s, mxy13i_s, 0xCC);
    return mxy0123;
}
The sigmoid function is often useful with artificial neural networks:
float sigmoid(const float x)
{
    return (1 / (1 + expf(-x)));
}
This can be approximated as a polynomial, which can then be expressed in fixed point. The difference from ideal is only ~1.5%:
int isigmoid(const int x)
{
    // The approximation for the right half in HornerForm is : y = 0. + x (20185.5 + x (-4794.54 + x(513.987 - 20.7585 x) ))
    // Once we wiggle around a little bit for the integer truncation it becomes:       y = 0. + x (19620.7 + x (-4405.61 + (432.455 - 15.5408 x) x))
    // The approximation is only valid for 0<=x<=6. We reflect to get the left half.
    // First we make sure that we are in the approximated region.
    int ax = abs(x);
    if (((UINT)ax) >= 6 * 65536)
    {
        return (x > 0) ? 65536 : 0;
    }
    // Take the abs and use from now on.
    int x4 = iMul(-16, ax);
    int x4p = x4 + 432;
    int x3 = iMul(ax, x4p);
    int x3p = x3 - 4406;
    int x2 = iMul(ax, x3p);
    int x2p = x2 + 19621;
    int x1 = iMul(ax, x2p);
    // Now that we have the left half approximation, reflect to the right half.
    int iRet = 32768 + ((x >= 0) ? (+x1) : (-x1));
    return iRet;
}

This translates easily into SSE code:

// requires SSSE3 and SSE41.
__m128i _mm_fixedSigmoid(const __m128i &x)
{
    __m128i mAbsXu = _mm_abs_epi32(x);
    __m128i mApproxInvalid = _mm_cmpGTE_epu32_SSE41(mAbsXu, _mm_set1_epi32(6*65536));
    const __m128i mPolynomialCoefficients = _mm_setr_epi32(-16, 432, -4406, 19621);
    __m128i m = _mm_fixedMul_epi32_SSE41(_mm_shuffle_epi32(mPolynomialCoefficients,_MM_SHUFFLE(0,0,0,0)), mAbsXu);
    __m128i mp = _mm_add_epi32(m, _mm_shuffle_epi32(mPolynomialCoefficients, _MM_SHUFFLE(1,1,1,1)));
    m = _mm_fixedMul_epi32_SSE41(mp, mAbsXu);
    mp = _mm_add_epi32(m, _mm_shuffle_epi32(mPolynomialCoefficients, _MM_SHUFFLE(2,2,2,2)));
    m = _mm_fixedMul_epi32_SSE41(mp, mAbsXu);
    mp = _mm_add_epi32(m, _mm_shuffle_epi32(mPolynomialCoefficients, _MM_SHUFFLE(3,3,3,3)));
    m = _mm_fixedMul_epi32_SSE41(mp, mAbsXu);
    __m128i mClamped = _mm_blendv_epi8(m, _mm_set1_epi32(32768), mApproxInvalid);
    __m128i mResult = _mm_add_epi32(_mm_set1_epi32(32768),_mm_sign_epi32(mClamped,x));
    return mResult;
}

This makes me yearn for floats.

For completeness, here's the one lane unsigned ansi fixed point multiply, and the four lane unsigned sse fixed point multiply. Not much different:
__forceinline unsigned int uMul(const unsigned int a, const unsigned int b)
{
    unsigned int uiRet = (unsigned int)(UInt32x32To64(a, b) >> 16);
    return uiRet;
}
__forceinline __m128i _mm_fixedMul_epu32_SSE41(const __m128i &x, const __m128i &y)
{
    __m128i mxy02i = _mm_mul_epu32(x, y);
    __m128i mxs = _mm_srli_si128(x, 4);
    __m128i mys = _mm_srli_si128(y, 4);
    __m128i mxy13i = _mm_mul_epu32(mxs, mys);
    __m128i mxy02i_s = _mm_srli_si128(mxy02i, 2);
    __m128i mxy13i_s = _mm_slli_si128(mxy13i, 2);
    __m128i mxy0123 = _mm_blend_epi16(mxy02i_s, mxy13i_s, 0xCC);
    return mxy0123;
}

Wednesday, December 21, 2016

let's qsort - sort comparison functions using SSE

C has had built-in sort for decades. This mostly takes the form of a qsort, which traditionally takes a function that takes void pointers.

The following is a very simple cast that makes this much safer. Note that since this contains only the one call, the compiler will happily inline, and this wrapper will add zero cost.

typedef int(__cdecl *fnPtFuncCompareType)(const void *arg1, const void *arg2);
template <typename T>
inline void qsort_typesafe(
    __inout_ecount(num) T *base,
    const size_t num,
    int(__cdecl *fnPtFuncCompareType)(const T *, const T *)
)
{
    ::qsort(base, num, sizeof(T), (fnPtFuncCompareType)fnCompare);
}

So all of the discussed comparison functions will take actual types.
For the simple types this will be as follows, using the formulation from "Hacker's Delight - Second Edition" section "2-9 Three-Valued Compare Function". Note that the second __forceinline will be ignored when the function is used as a parameter for qsort.

// generic comparison function
// returns 1 iff a>b
// returns 0 iff a==b
// returns -1 iff a<b
template<typename T>
__forceinline static int __cdecl compareTwo(T a, T b)
{
    int iRet = (a>b) - (a<b);
    return iRet;
}
// generic comparison function for use in qsort
template<typename T>
__forceinline static int __cdecl compareTwoPtr(const T *a, const T *b)
{
    return compareTwo(*a, *b);
}

And this works wonderfully for signed and unsigned standard types. For the built in numerical types this gives correct results without branches, but doesn't have the underflow/overflow problems.
But I seem to often end up doing comparisons on structures composed of these basic types.
We want to sort first by one field, then another, and so on...

typedef struct
{
    int rate;
    int duration;
} TwoIntStructure;
C_ASSERT(8 == sizeof(TwoIntStructure));
typedef struct
{
    int rate;
    int duration;
    int torque;
    int growth;
} FourIntStructure;
C_ASSERT(16 == sizeof(FourIntStructure));
int __cdecl compareTwo_TwoIntStructure(const TwoIntStructure *pa, const TwoIntStructure *pb)
{
    __m128i ma = _mm_loadl_epi64((__m128i *)pa);
    __m128i mb = _mm_loadl_epi64((__m128i *)pb);
    int intRet = _mm_movemask_epi8(_mm_cmpgt_epi32(ma, mb)) - _mm_movemask_epi8(_mm_cmpgt_epi32(mb, ma));
    return intRet;
}
int __cdecl compareTwo_FourIntStructure(const FourIntStructure *pa, const FourIntStructure *pb)
{
    __m128i ma = _mm_loadu_si128((__m128i *)pa);
    __m128i mb = _mm_loadu_si128((__m128i *)pb);
    int intRet = _mm_movemask_epi8(_mm_cmpgt_epi32(ma, mb)) - _mm_movemask_epi8(_mm_cmpgt_epi32(mb, ma));
    return intRet;
}
This gives good results when called with something like:
    qsort_typesafe(ax, ARRAYSIZE(ax), compareTwo_TwoIntStructure);
    qsort_typesafe(ay, ARRAYSIZE(ay), compareTwo_FourIntStructure);

When we are using C++, this changes slightly:

// Ascending sorting function, last struct member is primary sort key
struct SFourIntStructureSort
{
    bool operator()(const FourIntStructure* pa, const FourIntStructure* pb)
    {
        __m128i ma = _mm_loadu_si128(reinterpret_cast<const __m128i *>(pa));
        __m128i mb = _mm_loadu_si128(reinterpret_cast<const __m128i *>(pb));
        bool bRet = _mm_movemask_epi8(_mm_cmpgt_epi32(ma, mb))<=_mm_movemask_epi8(_mm_cmpgt_epi32(mb, ma));
        return bRet;
    }
};
with a call like:
    std::vector<FourIntStructure*> vecFourIntStructure;
    std::sort(vecFourIntStructure.begin(), vecFourIntStructure.end(), SFourIntStructureSort());

Do note that as written the primary sort key is the last of the compared elements in the structs.

Tuesday, December 20, 2016

A sorting network for eight 32-bit numbers in SSE4.1

One of the fundamental operations in computer science is sorting. And we are all familiar with a bunch of the algorithms like quicksort. It's really hard to do a quicksort with SSE. But some other sorting algorithms do work.

I see a lot of people talking about sorting networks using SSE, some pseudo-code, and some diagrams, but I have yet to see any code samples online for this. And the devil is always in the details.

Let's start with sorting 4 elements, taking advantage of _mm_max_epi32, though the code would work equally well with floats, needing only SSE2, using the _mm_max_ps group of instruction.
Like complex multiplication, performing two short sorts at the same time isn't much more work than sorting one.

The sorting network for four elements is three levels (each level is one set of min/max). The sorting network for eight elements is six levels deep.

void ShortSort_SSE41(const __m128i &a, const __m128i &b, __m128i &aOut, __m128i &bOut)
{
    __m128i ab_lo = _mm_unpacklo_epi64(a, b);
    __m128i ab_hi = _mm_unpackhi_epi64(a, b);
    __m128i ab1_min = _mm_min_epi32(ab_lo, ab_hi);
    __m128i ab1_max = _mm_max_epi32(ab_lo, ab_hi);
    __m128i a1 = ab1_min;
    __m128i b1 = _mm_shuffle_epi32(ab1_max, _MM_SHUFFLE(2, 3, 0, 1));
    __m128i ab2_min = _mm_min_epi32(a1, b1);
    __m128i ab2_max = _mm_max_epi32(a1, b1);
    __m128i a2 = _mm_unpacklo_epi32(ab2_min, ab2_max);
    __m128i b2 = _mm_unpackhi_epi32(ab2_min, ab2_max);
    __m128i ab3_lo = _mm_unpacklo_epi64(a2, b2);
    __m128i ab3_hi = _mm_unpackhi_epi64(a2, b2);
    __m128i ab3_min = _mm_min_epi32(ab3_lo, ab3_hi);
    __m128i ab3_max = _mm_max_epi32(ab3_lo, ab3_hi);
    aOut = _mm_unpacklo_epi32(ab3_min, ab3_max);
    bOut = _mm_unpackhi_epi32(ab3_min, ab3_max);
}

Like complex number multiplication, it is about the same amount of work to do two at the same time. So the above ShortSort sorts two separate four element arrays into increasing monotonic order.
Note that this code will work passing the same variable as both input and output for each of the lists.

__forceinline void SortLevel_SSE41(const __m128i &a, const __m128i &b, __m128i &ab_left, __m128i &ab_right)
{
    __m128i ab1_min = _mm_min_epi32(a, b);
    __m128i ab1_max = _mm_max_epi32(a, b);
    ab_left = _mm_unpacklo_epi32(ab1_min, ab1_max);
    ab_right = _mm_unpackhi_epi32(ab1_min, ab1_max);
}
__forceinline __m128i Reverse_epi32_SSE2(const __m128i &a)
{
    return _mm_shuffle_epi32(a, _MM_SHUFFLE(0, 1, 2, 3));
}
void Merge_SSE41(const __m128i &a, const __m128i &b, __m128i &ab_left, __m128i &ab_right)
{
    __m128i ab1_left, ab1_right;
    __m128i ab2_left, ab2_right;
    SortLevel_SSE41(a, Reverse_epi32_SSE2(b), ab1_left, ab1_right);
    SortLevel_SSE41(ab1_left, ab1_right, ab2_left, ab2_right);
    SortLevel_SSE41(ab2_left, ab2_right, ab_left, ab_right);
}

This takes two monotonic lists as input, and returns the sorted merged list in the two output variables.
As before, the same variable could be used for both input and output.
This is largely based on the network in the standard computer science textbook "Introduction to Algorithms" by Cormen, Leiserson, and Rivest. This does diverge from the text in the final level. The same level being used three times is just too good not to use. The code is tested using the 0/1 method described in the textbook.

A call to these would be something like:

        int a[8] =....
        __m128i aa, bb;
        aa = _mm_loadu_si128((const __m128i *)&a[0]);
        bb = _mm_loadu_si128((const __m128i *)&a[4]);
        ShortSort_SSE41(aa, bb, aa, bb);
        Merge_SSE41(aa, bb, aa, bb);
        _mm_storeu_si128((__m128i *)&a[0], aa);
        _mm_storeu_si128((__m128i *)&a[4], bb);

It's not a general sorting solution, but it seems like it could be useful.

Monday, December 19, 2016

Yet another dot product - 32bit*32bit = 64bit signed

Multiplying together two 32bit numbers gives a 64bit result. SSE2 supports an unsigned 32bit multiply yielding a 64bit unsigned result. But this isn't what we normally need for dot products.

The instruction _mm_mul_epi32 (a signed 32bit multiply yielding a 64bit signed result) was not supported until SSE41. While today SSE41 is pretty standard, this may not be the case on virtual machines. So there is value in understanding how to do this with only SSE2.

We can generate the signed 64bit multiply result from the unsigned 64bit multiply result. The procedure is detailed in Hacker's Delight 2nd edition in the chapter "Multiplication" in section "8-3 High-Order Product Signed from/to Unsigned".

Overall the four signed 32bit*32bit=64bit multiplies can be performed as follows:

__forceinline void _mm_mul_epi32_SSE2(const __m128i &x, const __m128i &y, __m128i &mXY02, __m128i &mXY13)
{
    __m128i mxy02u = _mm_mul_epu32(x, y);
    __m128i mxy13u = _mm_mul_epu32(_mm_srli_epi64(x,32), _mm_srli_epi64(y,32));
    __m128i mt1 = _mm_and_si128(y, _mm_srai_epi32(x, 31));
    __m128i mt2 = _mm_and_si128(x, _mm_srai_epi32(y, 31));
    __m128i mt1Pt2 = _mm_add_epi32(mt1, mt2);
    mXY02 = _mm_sub_epi32(mxy02u, _mm_unpacklo_epi32(_mm_setzero_si128(), mt1Pt2));
    mXY13 = _mm_sub_epi32(mxy13u, _mm_unpackhi_epi32(_mm_setzero_si128(), mt1Pt2));
}

Please do note that the results end up in a non-standard order that doesn't matter for the dot product:
The first __m128i returns
__int64 xy00Int = ((__int64)x.m128i_i32[0])*((__int64)y.m128i_i32[0]);
__int64 xy22Int = ((__int64)x.m128i_i32[2])*((__int64)y.m128i_i32[2]);
while the second __m128i return
__int64 xy11Int = ((__int64)x.m128i_i32[1])*((__int64)y.m128i_i32[1]);
__int64 xy33Int = ((__int64)x.m128i_i32[3])*((__int64)y.m128i_i32[3]);

This is easy to correct, but since we are just going to be summing, we don't bother.

__forceinline __int64 dotProduct_int64result_Ansi(const int *pA, const int *pB, UINT cElements)
{
    int iRet = 0;
    UINT cElementsRemaining = cElements;
    while (cElementsRemaining > 0)
    {
        cElementsRemaining--;
        iRet += ((__int64)pA[cElementsRemaining]) * ((__int64)pB[cElementsRemaining]);
    }
    return iRet;
}
__int64 dotProduct_int64result_SSE2(const signed int *pA, const signed int *pB, UINT cElements)
{
    UINT cElements_endOfFour = cElements&~3;
    __int64 iRet = dotProduct_int64result_Ansi(pA + cElements_endOfFour, pB + cElements_endOfFour, cElements & 3);
    UINT_PTR cElementsRemaining = cElements_endOfFour;
    if (cElementsRemaining > 0)
    {
        __m128i mSummed = _mm_setzero_si128();
        do
        {
            cElementsRemaining -= 4;
            __m128i mA = _mm_loadu_si128((__m128i const*)&pA[cElementsRemaining]);
            __m128i mB = _mm_loadu_si128((__m128i const*)&pB[cElementsRemaining]);
            __m128i mulLeft, mulRight;
            _mm_mul_epi32_SSE2(mA, mB, mulLeft, mulRight);
            mSummed = _mm_add_epi64(mSummed, mulLeft);
            mSummed = _mm_add_epi64(mSummed, mulRight);
        } while (cElementsRemaining > 0);
        iRet += horizontalSum_epi64_SSE2(mSummed);
    }
    return iRet;
}

It's ok. But on SSE2, it's only marginally better than the ANSI version. But it's instructive.

Saturday, December 17, 2016

Still more dot product goodness - unsigned short

Unfortunately there isn't an equivalent unsigned instruction to the _mm_madd_epi16 instruction in unsigned. The unsigned multiply has to be constructed out of two multiplies, one that produces as result the low 16 bits, the other that produces the high 16 bits. The instruction for the low 16bits is in fact shared with the signed version since the signed low 16 bit result is the same for signed and unsigned 16bit*16bit=32bit multiplies.

__forceinline unsigned int dotProduct_Ansi(const unsigned short *pA, const unsigned short *pB, UINT cElements)
{
    unsigned int iRet = 0;
    UINT cElementsRemaining = cElements;
    while (cElementsRemaining > 0)
    {
        cElementsRemaining--;
        iRet += ((unsigned int)pA[cElementsRemaining]) * ((unsigned int)pB[cElementsRemaining]);
    }
    return iRet;
}

unsigned int dotProduct_SSE2(const unsigned short *pA, const unsigned short *pB, UINT cElements)
{
    UINT cElements_endOfFour = cElements&~3;
    unsigned int iRet = dotProduct_Ansi(pA + cElements_endOfFour, pB + cElements_endOfFour, cElements & 3);
    UINT cElementsRemaining = cElements_endOfFour;
    if (cElementsRemaining > 0)
    {
        __m128i mSummedRight = _mm_setzero_si128();
        __m128i mSummedLeft = _mm_setzero_si128();
        do
        {
            cElementsRemaining -= 4;
            __m128i mA = _mm_loadu_si128((__m128i const*)&pA[cElementsRemaining]);
            __m128i mB = _mm_loadu_si128((__m128i const*)&pB[cElementsRemaining]);
            __m128i mulHi = _mm_mulhi_epu16(mA, mB);
            __m128i mulLo = _mm_mullo_epi16(mA, mB);
            __m128i mulLeft = _mm_unpacklo_epi16(mulLo, mulHi);
            __m128i mulRight = _mm_unpackhi_epi16(mulLo, mulHi);
            mSummedLeft = _mm_add_epi32(mSummedLeft, mulLeft);
            mSummedRight = _mm_add_epi32(mSummedRight, mulRight);
        } while (cElementsRemaining > 0);
        iRet += horizontalSum_SSE2(_mm_add_epi32(mSummedLeft, mSummedRight));
    }
    return iRet;
}
Keeping separate sums for left and right is to reduce register pressure to allow increased instruction pipelining.

Friday, December 16, 2016

another dot product - signed short

There's another dot product... yes, yet another dot product. It's almost as though the instruction set was designed around them.

The integer multiply instructions on SSE2 are a little sparse.  A full four lane 32bit*32bit=32bit multiply can be only be constructed by multiple _mm_mul_epu32 instructions. It isn't until the SSE4.1 support of the instruction _mm_mullo_epi32 that this seemingly fundamental operation was directly supported.

A single 32bit * 32bit = 64bit multiply can be composed of four 16bit*16bit = 32bit multiplies. The _mm_mul_epu32 has two of these. The equivalent of these eight multiplies are exposed in a couple different ways.

The _mm_madd_epi16 instruction seems directly designed for dot product. It is documented as "Multiply packed signed 16-bit integers in a and b, producing intermediate signed 32-bit integers. Horizontally add adjacent pairs of intermediate 32-bit integers, and pack the results in dst."
This requires internally eight 16bit*16bit = 32bit multiplies.

The code for a dot product using this instruction is looking familiar:

int dotProduct_Ansi(const signed short *pA, const signed short *pB, UINT cElements)
{
    int iRet = 0;
    UINT cElementsRemaining = cElements;
    while (cElementsRemaining > 0)
    {
        cElementsRemaining--;
        iRet += ((int)pA[cElementsRemaining]) * ((int)pB[cElementsRemaining]);
    }
    return iRet;
}
__forceinline int horizontalSum_SSE2(const __m128i &mABCD)
{
    __m128i mCD = _mm_srli_si128(mABCD, 8);
    __m128i mApCBpD = _mm_add_epi32(mABCD, mCD);
    __m128i mBpD = _mm_srli_si128(mApCBpD, 4);
    __m128i mApBpCpD = _mm_add_epi32(mApCBpD, mBpD);
    return _mm_cvtsi128_si32(mApBpCpD);
}
int dotProduct_SSE2(const signed short *pA, const signed short *pB, UINT cElements)
{
    UINT cElements_endOfEight = cElements&~7;
    int iRet = dotProduct_Ansi(pA + cElements_endOfEight, pB + cElements_endOfEight, cElements & 7);
    UINT cElementsRemaining = cElements_endOfEight;
    if (cElementsRemaining > 0)
    {
        __m128i mSummedAB = _mm_setzero_si128();
        do
        {
            cElementsRemaining -= 8;
            __m128i mA = _mm_loadu_si128((__m128i const*)&pA[cElementsRemaining]);
            __m128i mB = _mm_loadu_si128((__m128i const*)&pB[cElementsRemaining]);
            mSummedAB = _mm_add_epi32(mSummedAB, _mm_madd_epi16(mA, mB));
        } while (cElementsRemaining > 0);
        iRet += horizontalSum_SSE2(mSummedAB);
    }
    return iRet;
}
16-bit dot products come up frequently with artificial neural networks.

Thursday, December 15, 2016

Multiplying two complex numbers by two other complex numbers

Multiplying complex numbers together has the problem that the necessary negation is complicated in the SSE2 instruction set.

One thing that happens often with SSE2 programming is that sometimes it's better to make the problem more complicated. In this case it is often better to multiply two complex numbers by two other complex numbers at the same time.

__forceinline __m128 _mm_complexMultiplyTwo(const __m128 &ar_ai_cr_ci, const __m128 &br_bi_dr_di)
{
    // this is a complex multiply that complex multiplies two sets of complex numbers at the same time.
    // this would be lots easier with the SSE3 instructions.
    // The number 0x80000000 reinterpreted as a float is negative zero.
    // xor'ing against this constant will negate two numbers.
    __declspec(align(16)) static const UINT32 neg02[4] = { 0x80000000, 0, 0x80000000 ,0 };
    __m128 mneg02 = _mm_load_ps(reinterpret_cast<const float*>(&neg02[0]));
    // if we call the input vectors (ar, ai, cr, ci) and (br, bi, dr, di)
    // we are trying to get (ar*br - ai*bi, ai*br + ar*bi, cr*dr - ci*di, ci*dr + cr*di)
    // so we need four multipies
    __m128 ar_ar_cr_cr = _mm_shuffle_ps(ar_ai_cr_ci, ar_ai_cr_ci, _MM_SHUFFLE(2, 2, 0, 0));
    __m128 ai_ai_ci_ci = _mm_shuffle_ps(ar_ai_cr_ci, ar_ai_cr_ci, _MM_SHUFFLE(3, 3, 1, 1));
    __m128 bi_br_di_dr = _mm_shuffle_ps(br_bi_dr_di, br_bi_dr_di, _MM_SHUFFLE(2, 3, 0, 1));
    __m128 arbr_arbi_crdr_crdi = _mm_mul_ps(ar_ar_cr_cr, br_bi_dr_di);
    __m128 aibi_aibr_cidi_cidr = _mm_mul_ps(ai_ai_ci_ci, bi_br_di_dr);
    __m128 Naibi_aibr_Ncidi_cidr = _mm_xor_ps(aibi_aibr_cidi_cidr, mneg1);
    __m128 arbrMaibi_arbiPaibr_crdrMcidi_crciPcidr = _mm_add_ps(arbr_arbi_crdr_crdi, Naibi_aibr_Ncidi_cidr);
    return arbrMaibi_arbiPaibr_crdrMcidi_crciPcidr;
}

This compiles to:

movaps      xmm0,xmm3
movaps      xmm1,xmm2
shufps      xmm0,xmm3,0B1h
shufps      xmm1,xmm2,0F5h
mulps       xmm1,xmm0
shufps      xmm2,xmm2,0A0h
mulps       xmm2,xmm3
xorps       xmm1,xmmword ptr [`_mm_complexMultiplyTwo'::`2'::neg1 (010D2100h)]
addps       xmm1,xmm2
This looks pretty good. In particular it uses very few registers. The one problem is that load from a constant. But most of the time we are performing this in a loop, and the compiler is good enough to factor that out of the loop for us and use one of the other registers.

It takes two moves, three shuffles, two multiplies, an add, and an xor against a memory read. Really that's only one additional multiply from the previously discussed multiplying together just one set of complex numbers.

complex multiplies using SSE2

The _mm_mul_ps allows us to parallel multiply two sets of four numbers as a single instruction. I remember something about something similar from a high-school math class.

It takes four real multiplies to perform a single complex multiply. Let's figure out how to make this work more.

The problem we have is that horrible minus sign on the right hand side of the equals. Until SSE3 there wasn't an add instruction that did both add and subtract.

And we need to get all of the results lined up at the end. For the purpose of discussion I'm going to provide a solution:

__m128 _mm_complexMultiply(__m128 ari, __m128 bri)
{
    static const UINT uiNegativeZero = 0x80000000;
    __m128 Ar_Ai_Ai_Ar = _mm_shuffle_ps(ari, ari, _MM_SHUFFLE(0, 1, 1, 0));
    __m128 z_z_N_z= _mm_setr_ps(0,0,*(reinterpret_cast<const float *>(&uiNegativeZero)),0);
    __m128 Ar_Ai_NAi_Ar = _mm_xor_ps(Ar_Ai_Ai_Ar, z_z_N_z);

    __m128 Br_Br_Bi_Bi = _mm_shuffle_ps(bri, bri, _MM_SHUFFLE(1, 1, 0, 0));
    __m128 mMul = _mm_mul_ps(Br_Br_Bi_Bi, Ar_Ai_NAi_Ar);
    __m128 ArBrmAiBi_AiBrpArBi_x_x= _mm_add_ps(mMul, _mm_movehl_ps(Br_Br_Bi_Bi, mMul));
    return ArBrmAiBi_AiBrpArBi_x_x;
}
Looks pretty good once compiled:
00DC1176 movq        xmm0,mmword ptr [edx]
00DC117A movq        xmm1,mmword ptr [ecx]
00DC117E mov         eax,dword ptr [vri]
00DC1181 shufps      xmm0,xmm0,14h
00DC1185 xorps       xmm0,xmmword ptr [__xmm@00000000800000000000000000000000 (0DC2110h)]
00DC118C shufps      xmm1,xmm1,50h
00DC1190 mulps       xmm0,xmm1
00DC1193 movhlps     xmm1,xmm0
00DC1196 addps       xmm1,xmm0
00DC1199 movq        mmword ptr [eax],xmm1
Yup - that still looks okay.

The above has two shuffles but is designed to minimize the number of operations after the multiply. It seems like there should be a way of reducing latency until there.

While there seem to be ways of reducing latency for a single multiply, those ways seem to preclude better things. I ask you to trust me on this.

Okay news: The above does leave meaningful garbage in the upper half of the XMM register. But that's easy to zero out or ignore during the copy to memory.

Good news: The above gets it down to two shuffles, one xor, one multiply, and one movehl, and one add. But much of the time we will be multiplying a single vector against a large bunch of other runtime vectors (As is the case with the implementation of FFT as described in the book "Numerical Recipes in C") . (some of this is described in an Intel blog post that I remember but can no longer find despite deep google searches). This allows easy optimization of computing the vector for A and using many times with different B vectors.

Okay news: This is pretty efficient. The place where we can improve is in that horrible negative. While yes, a complex multiplication could be performed in three multiplies, the cost of additions, subtractions, and shuffles with the SSE2 instruction set would overwhelm this strategy with a register width of 128bits. We are pretty much never going to do better than one 4 element multiply per complex multiplication.

Good news: Within the SSE2 instruction set we can however spread the overhead cost in a couple ways. One way is to factor the A shuffle and xor out, by multiplying a single complex number against a multitude of other complex numbers.

Good news if we can make it work: Another way to address this problem is to spread the negation over several multiplications. There are already enough shuffles floating around to make it possible to negate several complex multiplications in a single xor.

We do lose a cycle for each transition between float and int. The observant reader will note three of these in the above code. But I think it's worth it.

We will address performing multiple complex multiplications concurrently in a future post.

Yup. This is a cliffhanger. I'm betting that it is worth a couple machine cycles to read the followup. :)

Tuesday, December 13, 2016

descriptive statistics using SSE2 : higher moments

There are a bunch of descriptive statistics to describe a given set of numbers. The most commonly used are the average and standard deviation. There are also more esoteric statistics like skewness and kurtosis.

The book "Numerical Recipes in C" is a common source for routines to calculate these statistics. This can be adapted to take advantage of SSE2.

void moment_SSE2(const float * const pA, const UINT n, float *outAve, float *outAdev, float *outSdev, float *outVar, float *outSkew, float *outCurt)
{
    float ave = average_SSE2(pA, n);
    *outAve = ave;
    *outAdev = 0;
    *outSdev = 0;
    *outVar = 0;
    *outSkew = 0;
    *outCurt = 0;
    if (n <= 1)
    {
        return;
    }
    __m128 mmAve = _mm_set1_ps(ave);
    __m128 mmAbs = _mm_castsi128_ps(_mm_set1_epi32(0x7fffffff));
    __m128 mmAdev = _mm_setzero_ps();
    __m128 mmSdev= _mm_setzero_ps();
    __m128 mmVar = _mm_setzero_ps();
    __m128 mmSkew = _mm_setzero_ps();
    __m128 mmCurt = _mm_setzero_ps();
    __m128 mmEp= _mm_setzero_ps();
    UINT nRemaining = n;
    __m128 mmS;
    if (nRemaining & 3)
    {
        __m128i mmDataMask = _mm_setr_epi32((nRemaining & 1)?-1:0, 0, (nRemaining & 2) ? -1 : 0, (nRemaining & 2) ? -1 : 0);
        __m128 mmData = _mm_setzero_ps();
        if (nRemaining & 1)
        {
            nRemaining--;
            mmData = _mm_load_ss(&pA[nRemaining]);
        }
        if (nRemaining & 2)
        {
            nRemaining -= 2;
            mmData = _mm_loadh_pi(mmData, (const __m64*)&pA[nRemaining]);
        }
        mmS = _mm_and_ps(_mm_castsi128_ps( mmDataMask), _mm_sub_ps(mmData, mmAve));
    }
    else
    {
        nRemaining -= 4;
        __m128 mmData = _mm_loadu_ps(&pA[nRemaining]);
        mmS = _mm_sub_ps(mmData, mmAve);
    }
    while (true)
    {
        __m128 mmAbsS = _mm_and_ps(mmS, mmAbs);
        mmAdev = _mm_add_ps(mmAdev, mmAbsS);
        mmEp = _mm_add_ps(mmEp, mmS);
        __m128 mmSS = _mm_mul_ps(mmS, mmS);
        mmVar = _mm_add_ps(mmVar, mmSS);
        mmSkew = _mm_add_ps(mmSkew, _mm_mul_ps(mmSS, mmS));
        mmCurt = _mm_add_ps(mmCurt, _mm_mul_ps(mmSS, mmSS));
        if (nRemaining <= 0)
        {
            break;
        }
        nRemaining -= 4;
        __m128 mmData = _mm_loadu_ps(&pA[nRemaining]);
        mmS = _mm_sub_ps(mmData, mmAve);
    }
    *outAdev = horizontalSum_SSE2(mmAdev) / n;
    float ep = horizontalSum_SSE2(mmEp);
    float var = (horizontalSum_SSE2(mmVar) - ep*ep / n) / (n - 1);
    *outVar = var;
    float sdev = sqrtf(var);
    *outSdev = sdev;
    if (var)
    {
        *outSkew = horizontalSum_SSE2(mmSkew) / (n*var*sdev);
        *outCurt = (horizontalSum_SSE2(mmCurt) / (n*var*var)) - 3.0f;
    }
}
This builds on helper functions from previous posts.

The central loop body looks a bit strange. The loop condition has been moved inside the "while loop". This allows use of the same loop body for the first iteration that may not contain a full four elements. A mask is created which zeros the missing elements of this first iteration. These zero elements don't change the resulting sums.

This seems useful.

using SSE to calculate the average of a set of numbers

The average of a group of numbers is one of the fundamental statistical descriptions of data. It's also easy to calculate in Ansi.
The horizontalSum we have seen before.

__forceinline float horizontalSum_SSE2(const __m128 &mABCD)
{
    __m128 mCDCD = _mm_movehl_ps(mABCD, mABCD);
    __m128 mApCBpD = _mm_add_ps(mABCD, mCDCD);
    __m128 mBpD = _mm_shuffle_ps(mApCBpD, mApCBpD, 0x55);
    __m128 mApBpCpD = _mm_add_ps(mApCBpD, mBpD);
    return _mm_cvtss_f32(mApBpCpD);
}
float average_SSE2(const float * const pA, const UINT uiAOrig)
{
    __m128 mSummed = _mm_setzero_ps();
    UINT uiA = uiAOrig;
    if (uiA & 1)
    {
        uiA--;
        mSummed = _mm_load_ss(&pA[uiA]);
    }
    if (uiA & 2)
    {
        uiA -= 2;
        mSummed = _mm_loadh_pi(mSummed, (const __m64*)&pA[uiA]);
    }
    while (uiA > 0)
    {
        uiA -= 4;
        mSummed = _mm_add_ps(mSummed, _mm_loadu_ps(&pA[uiA]));
    }
    return horizontalSum_SSE2(mSummed)/ uiAOrig;
}

There's not much new here. But I will mention the line

mSummed = _mm_loadh_pi(mSummed, (const __m64*)&pA[uiA]);

we load into the other half of the xmm register than was loaded in the first load. It's also sometimes necessary to cast among types for loads and other operations.

Monday, December 12, 2016

string processing using SSE2 - making sure we don't crash

Much of the time we know the size of the array to be examined. But there is a common case when we don't know the size ahead of time, and that case is strings. String length in C++ is determined at runtime by the location of a terminating null character.

Ordinarily it is very very bad to read before or after a buffer. This can cause crashes. There is a very limited case where we can be guaranteed that a buffer overrun or buffer underrun will not result in a crash : when the buffer overrun or buffer underrun is within the same cache line.

The acquiring the cache line size is not portable between AMD and Intel. The cache line size has generally been 64bytes in recent years. But in every CPU that supports SSE2, the cache line size is at least 16 bytes. This means that if we are able to access a memory location p, we are able to access the entire 16 byte block starting at (p & ~0xf).

Loading a single byte of memory from a cache line takes the same amount of time as loading all of the bytes from a cache line.

size_t __cdecl strlen_SSE2(const char *szOrig)
{
    const char * sz = szOrig;
    __m128i mmZero = _mm_setzero_si128();
    ULONG_PTR byteOffsetFromAligned = (15 & (ULONG_PTR)sz);
    if (byteOffsetFromAligned)
    {
        ULONG_PTR pAlignedsz = ((ULONG_PTR)sz) - byteOffsetFromAligned;
        ULONG iMoveMaskRet;
        iMoveMaskRet = _mm_movemask_epi8(_mm_cmpeq_epi8(_mm_load_si128((__m128i *)pAlignedsz), mmZero)) & ((1 << 16) - (1 << byteOffsetFromAligned));
        if (iMoveMaskRet)
        {
            ULONG ulIndexBit;
            _BitScanForward(&ulIndexBit, iMoveMaskRet);
            const size_t retLength = (ulIndexBit - byteOffsetFromAligned);
            return retLength;
        }
        sz = 16 + (const char *)pAlignedsz;
    }
    ULONG iMoveMaskRet;
    while (!(iMoveMaskRet = _mm_movemask_epi8(_mm_cmpeq_epi8(_mm_load_si128((__m128i *)sz), mmZero))))
    {
        sz += 16;
    }
    ULONG ulIndexBit;
    _BitScanForward(&ulIndexBit, iMoveMaskRet);
    const size_t retLength = (sz - szOrig) + ulIndexBit;
    return retLength;
}
If the function is called on a memory location that is not aligned the function backs up to the start of a 16 byte block. In this case we make sure to ignore null characters that happen before the memory location in the calling parameters. This is done by masking, in particular

& ((1 << 16) - (1 << byteOffsetFromAligned))

We are looking for the first null character. The core of this function is the loop

while (!(iMoveMaskRet = _mm_movemask_epi8(_mm_cmpeq_epi8(_mm_load_si128((__m128i *)sz), mmZero))))
{
sz += 16;
}

This loads a 16 byte block from aligned memory. It compares each of the bytes to null. It exits the loop if any of the bytes is null. If none of the 16 bytes are null, it advances to the next aligned 16 byte block.

To find the first byte we are looking for the index of the lowest 1 bit in the result of the _mm_movemask_epi8. This bit manipulation functionality is exposed by the intrinsic _BitScanForward.

The compiled assembly is a very tight central loop:
01001060 movaps      xmm0,xmmword ptr [edx]
01001063 pcmpeqb     xmm0,xmm1
01001067 pmovmskb    eax,xmm0
0100106B test        eax,eax
0100106D jne         strlen_SSE2+83h (01001083h)
0100106F nop
01001070 movaps      xmm0,xmmword ptr [edx+10h]
01001074 add         edx,10h
01001077 pcmpeqb     xmm0,xmm1
0100107B pmovmskb    eax,xmm0
0100107F test        eax,eax
01001081 je          strlen_SSE2+70h (01001070h)

Overall this routine takes about half the time of the standard strlen. In x64 code it ends up being faster for all but empty strings. For x86 code this routine ends up being the same factor faster for long strings, but is in fact a little slower for short strings. On x86, the break-even point seems to be around string 15 characters long.

This is easily adjusted for use with wide character (16bit) strings as a replacement for wcslen.

Sunday, December 11, 2016

testing the cpuid bits to determine sse42 support for crc32c

All x86-64 cpus support the SSE and SSE2 instruction set. While all of the recently supported cpus have all of the features like SSE42 and the crc32c instructions, making truly compatible programs still requires testing for cpu support.

The naïve implementation checks for the support, stores it in a variable and tests the variable before cpu dependent instructions. The problem with this naïve approach is that optimizing compilers are allowed to reorder instructions from within the consequent (the "then" block) to before the variable test. So this may cause crashes.

The way that I have seen this dealt with is to encapsulate the different versions of the code into separate functions, all with identical signature. A function pointer is initialized at program start to the most appropriate version of the function.

Here is an implementation that tests for the SSE42 bit before using the crc32c instructions.

unsigned int reverseBits32(unsigned int x)
{
    x = (x & 0x55555555) << 1 | (x & 0xAAAAAAAA) >> 1;
    x = (x & 0x33333333) << 2 | (x & 0xCCCCCCCC) >> 2;
    x = (x & 0x0F0F0F0F) << 4 | (x & 0xF0F0F0F0) >> 4;
    x = (x & 0x00FF00FF) << 8 | (x & 0xFF00FF00) >> 8;
    x = (x & 0x0000FFFF) << 16 | (x & 0xFFFF0000) >> 16;
    return x;
}

unsigned int crc32c_Ansi(const unsigned char *message, const unsigned int messageLength)
{
    unsigned int crc = 0xffffffff;
    for (unsigned int i = 0; i < messageLength; i++)
    {
        unsigned int byteMessage = reverseBits32(message[i]);
        for (int j = 8; j >0; j--)
        {
            if (((int)(crc^byteMessage)) < 0)
            {
                crc = (crc << 1) ^ 0x1edc6f41;
            }
            else
            {
                crc = crc << 1;
            }
            byteMessage = byteMessage << 1;
        }
    }
    return reverseBits32(~crc);
}

unsigned int crc32c_SSE42(const unsigned char *message, const unsigned int messageLength)
{
    unsigned int crc = 0xffffffff;
    for (unsigned int uiOnChar = 0; uiOnChar < messageLength; uiOnChar++)
    {
        crc = _mm_crc32_u8(crc, message[uiOnChar]);
    }
    return ~crc;
}

bool bHasSSE42()
{
    int cpuInfo[4];
    __cpuidex(cpuInfo, 1, 0);
    return (cpuInfo[2] >> 20) & 1;
}

typedef unsigned int (*crc32c_type)(const unsigned char *message, const unsigned int messageLength);

crc32c_type crc32c = (bHasSSE42()? crc32c_SSE42:crc32c_Ansi);

int main()
{
    const char szHelloWorld[] = "Hello World!";
    printf("%s - 0x%08x", szHelloWorld, crc32c((const unsigned char *)szHelloWorld, (unsigned int)strlen(szHelloWorld)));
return 0;
}

Neither implementation is a particularly efficient implementation of crc32c (the Ansi version would be optimized by using large tables, the SSE42 version would be optimized by sending larger blocks of memory per instruction).

But in this example the function pointer crc32c is initialized to the appropriate version and called.

Finding the index of the maximum value in an array

A bunch of linear algorithms can be sped with SSE. In general SSE allows operations to happen in groups in the time it would normally take a single operation. But the more complicated loop structures will generally mean that the SSE version will be only twice as fast as the Ansi version.

A common problem that comes up is to find the maximum or minimum value in an array.
Here's the Ansi version:

const int * findLargestInt_Ansi(const int * const pA, const UINT cAOrig)
{
    if (cAOrig<=0)
    {
        return nullptr;
    }
    UINT cARemaining = cAOrig;
    cARemaining--;
    const int *pBest = pA + cARemaining;
    int iBestVal = *pBest;
    while (cARemaining > 0)
    {
        cARemaining--;
        int iOnVal = pA[cARemaining];
        if (iOnVal > iBestVal)
        {
            pBest = pA + cARemaining;
            iBestVal = iOnVal;
        }
    }
    return pBest;
}
This can be made faster.

We often want to simulate conditionals in SSE. This will generally be implemented with masking, also known as blending. This generally takes the form of a bitwise

Result = A * C + B * !C

In SSE4.1 this is directly supported with the blend instructions. For SSE2 we have to implement from more basic instructions.
We implement this bitwise blend with
__forceinline __m128i _mm_blend_si128_SSE2(const __m128i &mA, const __m128i &mB, const __m128i &mControl)
{
    return _mm_or_si128(_mm_and_si128(mControl, mA), _mm_andnot_si128( mControl, mB));
}

This as a side effect allows us to change specific elements inside of an XMM register.

This technique is used below:

const int * findLargestInt_SSE2(const int *pA, UINT cAOrig)
{
    if (cAOrig >= 4)
    {
        UINT uiA = cAOrig;
        __m128i mBestValues = _mm_loadu_si128((const __m128i *)pA);
        __m128i mBestIndexBase = _mm_setzero_si128();
        while (uiA > 4)
        {
            uiA-=4;
            __m128i mValues = _mm_loadu_si128((const __m128i *)&pA[uiA]);
            __m128i mCompare = _mm_cmpgt_epi32(mValues, mBestValues);
            if (_mm_movemask_epi8(mCompare))
            {
                __m128i mIndexBase = _mm_set1_epi32(uiA);
                mBestValues = _mm_blend_si128_SSE2(mValues, mBestValues, mCompare);
                mBestIndexBase = _mm_blend_si128_SSE2(mIndexBase, mBestIndexBase, mCompare);
            }
        }
        __m128i mBestIndex = _mm_add_epi32(mBestIndexBase, _mm_setr_epi32(0, 1, 2, 3));
        __m128i mValues, mCompare;
        mValues = _mm_shuffle_epi32(mBestValues, _MM_SHUFFLE(2,3,2,3));
        mCompare = _mm_cmpgt_epi32(mValues, mBestValues);
        mBestValues = _mm_blend_si128_SSE2(mValues, mBestValues, mCompare);
        mBestIndex = _mm_blend_si128_SSE2(_mm_shuffle_epi32(mBestIndex, _MM_SHUFFLE(2, 3, 2, 3)), mBestIndex, mCompare);
        mValues = _mm_shuffle_epi32(mBestValues, _MM_SHUFFLE(1, 1, 1, 1));
        mCompare = _mm_cmpgt_epi32(mValues, mBestValues);
        // mBestValues = _mm_blend_si128_SSE2(mValues, mBestValues, mCompare);
        mBestIndex = _mm_blend_si128_SSE2(_mm_shuffle_epi32(mBestIndex, _MM_SHUFFLE(1,1,1,1)), mBestIndex, mCompare);
        return pA + _mm_cvtsi128_si32(mBestIndex);
    }
    return findLargestInt_Ansi(pA,cAOrig);
}

As we are iterating through the array, we maintain four maximums. Then as a final step the maximum of these four is determined. With the value we store the index of the first of the four elements loaded into the register. The offset from that base is its location in the XMM register.

Once again we have to address the part of the input array that isn't a multiple of the size of the XMM register. In this case we deal with these elements first. This does mean that we will be reading a couple elements twice. But since the maximum operation is idempotent, everything is okay.

This example shows a couple things. The first is how difficult it is to do horizontal operations. In the above example the horizontal operation is finding the largest value in a single register. To perform this we end up comparing the register against itself shifted right two elements, then that result is compared against itself shifted right one element. The bookkeeping is performed so that the index of the maximum value follows with the value.

The calculation of the actual maximum value is commented out in the above code since it is not actually returned from the function.

This example code can trivially be adjusted for other comparison operators on 32bit types including _mm_cmplt_epi32 to find a minimum int, or _mm_cmpgt_ps to find the maximum float, or _mm_cmplt_ps to find the minimum float.

The SSE2 version of the code takes a little under half the time as the Ansi version on longer arrays, and is about the same speed even for small arrays.

Saturday, December 10, 2016

another dot product - double

In practice, use of double precision float does not make sense. Double precision float normally is more memory and processing than is actually needed. They are absolutely necessary for chaotic systems and at very tiny scales. But more often double precision is really used to address numerical issues such as the near zero difference between large numbers in the discriminant of the quadratic equation, which are better solved in other ways. But use of doubles is still sometimes necessary.

The structure of the double precision float dot product is very similar to the standard float dot product.

__forceinline double horizontalSum_SSE2(const __m128d &mAB)
{
    __m128d mBB = _mm_unpackhi_pd(mAB, mAB);
    __m128d mApB = _mm_add_pd(mAB, mBB);
    return _mm_cvtsd_f64(mApB);
}
double dotProduct_SSE2_unaligned(const double *pA, const double *pB, UINT uiABOrig)
{
    UINT uiAB = uiABOrig;
    double dRet = 0;
    if (uiAB & 1)
    {
        uiAB--;
        dRet = pA[uiAB] * pB[uiAB];
    }
    if (uiAB > 0)
    {
        __m128d mSummedAB = _mm_setzero_pd();
        do
        {
            uiAB -= 2;
            __m128d mA = _mm_loadu_pd((const double *)&pA[uiAB]);
            __m128d mB = _mm_loadu_pd((const double *)&pB[uiAB]);
            mSummedAB = _mm_add_pd(mSummedAB,_mm_mul_pd(mA, mB));
        } while (uiAB > 0);
        dRet += horizontalSum_SSE2(mSummedAB);
    }
    return dRet;
}