Sunday, July 9, 2017

A basic implementation of a 4-element Discreet Cosine Transform in SSE

The discreet cosine transform (DCT) comes up a lot in graphics and sound processing.  These will often be done with a relative of the FFT.

The inverse of the DCT is a slightly different DCT. (DCT type 2 is the inverse of DCT type 3 and vice versa)

A four element DCT is almost trivially small, but I think that it will offer opportunities for a variety of interesting optimizations.

The DCT is linear, so a 4 element transform can be written as a multiplication by a 4x4 matrix.

A straightforward implementation follows.

__m128 DCT_II(const __m128 mIn)
{
    __declspec(align(16)) static const float mDctMatrix[] = { 0.5000000000000000f, 0.4619397662556434f, 0.3535533905932738f, \
        0.1913417161825449f, 0.5000000000000000f, 0.1913417161825449f, \
        - 0.3535533905932738f, -0.4619397662556434f, 0.5000000000000000f, \
        - 0.1913417161825449f, -0.3535533905932738f, 0.4619397662556434f, \
        0.5000000000000000f, -0.4619397662556434f, 0.3535533905932738f, \
        - 0.1913417161825449f };
    __m128 mProd0 = _mm_mul_ps(_mm_shuffle_ps(mIn, mIn, _MM_SHUFFLE(0, 0, 0, 0)), _mm_load_ps(&mDctMatrix[0]));
    __m128 mProd1 = _mm_mul_ps(_mm_shuffle_ps(mIn, mIn, _MM_SHUFFLE(1, 1, 1, 1)), _mm_load_ps(&mDctMatrix[4]));
    __m128 mProd2 = _mm_mul_ps(_mm_shuffle_ps(mIn, mIn, _MM_SHUFFLE(2, 2, 2, 2)), _mm_load_ps(&mDctMatrix[8]));
    __m128 mProd3 = _mm_mul_ps(_mm_shuffle_ps(mIn, mIn, _MM_SHUFFLE(3, 3, 3, 3)), _mm_load_ps(&mDctMatrix[12]));

    return _mm_add_ps(_mm_add_ps(mProd0, mProd1), _mm_add_ps(mProd2, mProd3));
}

and its inverse:

__m128 DCT_III(const __m128 mIn)
{
    __declspec(align(16)) static const float mDctMatrix[] = { 0.5000000000000000f, 0.5000000000000000f, 0.5000000000000000f, \
        0.5000000000000000f, 0.9238795325112868f, 0.3826834323650898f, \
        - 0.3826834323650898f, -0.9238795325112868f, 0.7071067811865475f, \
        - 0.7071067811865475f, -0.7071067811865475f, 0.7071067811865475f, \
        0.3826834323650898f, -0.9238795325112868f, 0.9238795325112868f, \
        - 0.3826834323650898f };
    __m128 mProd0 = _mm_mul_ps(_mm_shuffle_ps(mIn, mIn, _MM_SHUFFLE(0, 0, 0, 0)), _mm_load_ps(&mDctMatrix[0]));
    __m128 mProd1 = _mm_mul_ps(_mm_shuffle_ps(mIn, mIn, _MM_SHUFFLE(1, 1, 1, 1)), _mm_load_ps(&mDctMatrix[4]));
    __m128 mProd2 = _mm_mul_ps(_mm_shuffle_ps(mIn, mIn, _MM_SHUFFLE(2, 2, 2, 2)), _mm_load_ps(&mDctMatrix[8]));
    __m128 mProd3 = _mm_mul_ps(_mm_shuffle_ps(mIn, mIn, _MM_SHUFFLE(3, 3, 3, 3)), _mm_load_ps(&mDctMatrix[12]));

    return _mm_add_ps(_mm_add_ps(mProd0, mProd1), _mm_add_ps(mProd2, mProd3));
}

There are opportunities for optimization here that I will be revisiting.

No comments:

Post a Comment