Sunday, July 9, 2017

A basic implementation of a 4-element Discreet Cosine Transform in SSE

The discreet cosine transform (DCT) comes up a lot in graphics and sound processing.  These will often be done with a relative of the FFT.

The inverse of the DCT is a slightly different DCT. (DCT type 2 is the inverse of DCT type 3 and vice versa)

A four element DCT is almost trivially small, but I think that it will offer opportunities for a variety of interesting optimizations.

The DCT is linear, so a 4 element transform can be written as a multiplication by a 4x4 matrix.

A straightforward implementation follows.

__m128 DCT_II(const __m128 mIn)
{
    __declspec(align(16)) static const float mDctMatrix[] = { 0.5000000000000000f, 0.4619397662556434f, 0.3535533905932738f, \
        0.1913417161825449f, 0.5000000000000000f, 0.1913417161825449f, \
        - 0.3535533905932738f, -0.4619397662556434f, 0.5000000000000000f, \
        - 0.1913417161825449f, -0.3535533905932738f, 0.4619397662556434f, \
        0.5000000000000000f, -0.4619397662556434f, 0.3535533905932738f, \
        - 0.1913417161825449f };
    __m128 mProd0 = _mm_mul_ps(_mm_shuffle_ps(mIn, mIn, _MM_SHUFFLE(0, 0, 0, 0)), _mm_load_ps(&mDctMatrix[0]));
    __m128 mProd1 = _mm_mul_ps(_mm_shuffle_ps(mIn, mIn, _MM_SHUFFLE(1, 1, 1, 1)), _mm_load_ps(&mDctMatrix[4]));
    __m128 mProd2 = _mm_mul_ps(_mm_shuffle_ps(mIn, mIn, _MM_SHUFFLE(2, 2, 2, 2)), _mm_load_ps(&mDctMatrix[8]));
    __m128 mProd3 = _mm_mul_ps(_mm_shuffle_ps(mIn, mIn, _MM_SHUFFLE(3, 3, 3, 3)), _mm_load_ps(&mDctMatrix[12]));

    return _mm_add_ps(_mm_add_ps(mProd0, mProd1), _mm_add_ps(mProd2, mProd3));
}

and its inverse:

__m128 DCT_III(const __m128 mIn)
{
    __declspec(align(16)) static const float mDctMatrix[] = { 0.5000000000000000f, 0.5000000000000000f, 0.5000000000000000f, \
        0.5000000000000000f, 0.9238795325112868f, 0.3826834323650898f, \
        - 0.3826834323650898f, -0.9238795325112868f, 0.7071067811865475f, \
        - 0.7071067811865475f, -0.7071067811865475f, 0.7071067811865475f, \
        0.3826834323650898f, -0.9238795325112868f, 0.9238795325112868f, \
        - 0.3826834323650898f };
    __m128 mProd0 = _mm_mul_ps(_mm_shuffle_ps(mIn, mIn, _MM_SHUFFLE(0, 0, 0, 0)), _mm_load_ps(&mDctMatrix[0]));
    __m128 mProd1 = _mm_mul_ps(_mm_shuffle_ps(mIn, mIn, _MM_SHUFFLE(1, 1, 1, 1)), _mm_load_ps(&mDctMatrix[4]));
    __m128 mProd2 = _mm_mul_ps(_mm_shuffle_ps(mIn, mIn, _MM_SHUFFLE(2, 2, 2, 2)), _mm_load_ps(&mDctMatrix[8]));
    __m128 mProd3 = _mm_mul_ps(_mm_shuffle_ps(mIn, mIn, _MM_SHUFFLE(3, 3, 3, 3)), _mm_load_ps(&mDctMatrix[12]));

    return _mm_add_ps(_mm_add_ps(mProd0, mProd1), _mm_add_ps(mProd2, mProd3));
}

There are opportunities for optimization here that I will be revisiting.

Using the inverse square root to calculate 2d normal vectors in SSE

The inverse square root (1/√x) comes up in computer graphics.  This particularly comes up in the calculation of unit direction vectors.  Game graphics have famously used a low-precision fast version.
We can calculate the unit direction vector for two separate vectors at the same time.

template<bool bUseFasterButLessAccurate>
__m128 directionVectorsSSE(const __m128 &mIn)
{
    __m128 mInSq = _mm_mul_ps(mIn, mIn);
    __m128 mHorizontalSumSq = _mm_add_ps(mInSq, _mm_shuffle_ps(mInSq, mInSq, _MM_SHUFFLE(2, 3, 0, 1)));
    __m128 mScaled;
    if (bUseFasterButLessAccurate)
    {
        __m128 mRsqrt = _mm_rsqrt_ps(mHorizontalSumSq);
        mScaled = _mm_mul_ps(mIn, mRsqrt);
    }
    else
    {
        __m128 mSqrt = _mm_sqrt_ps(mHorizontalSumSq);
        mScaled = _mm_div_ps(mIn, mSqrt);
    }
    __m128 mIsZero = _mm_cmpgt_ps(mHorizontalSumSq, _mm_setzero_ps());
    __m128 mScaledIsZero = _mm_and_ps(mScaled, mIsZero);
    return mScaledIsZero;
}

This will use either the direct calculation or less accurate version based on the template parameter.

It seems like there should be a way to make use of the actual horizontal sum instruction, but that would just move the required shuffle, so does not benefit the code.

We do have a division by something that could be zero.  We don't want to return NaN for a zero input vector.  The above code handles this by returning a zero vector by and'ing the division against the bit result of a comparison.