There are some fun optimizations that can be done for the DCT(type 2 in this case) for striped input.
The striped input version takes as input and produces the output that is the transpose of four separate standard DCT calls.
While striping may seem like an esoteric way of doing things, the transpose is going to be a big part of 2-dimensional DCT, so this doesn't impose much additional work.
void DCT_II_T( __m128 & mInOutTA, __m128 & mInOutTB, __m128 & mInOutTC, __m128 & mInOutTD)
{
__declspec(align(16)) static const float aDctCoefs[] = {
0.5000000000000000f, 0.4619397662556434f, 0.3535533905932738f, 0.1913417161825449f };
__m128 mTApTD = _mm_add_ps(mInOutTA, mInOutTD);
__m128 mTAsTD = _mm_sub_ps(mInOutTA, mInOutTD);
__m128 mTBpTC = _mm_add_ps(mInOutTB, mInOutTC);
__m128 mTBsTC = _mm_sub_ps(mInOutTB, mInOutTC);
__m128 mResultTA = _mm_mul_ps(_mm_set1_ps(aDctCoefs[0]), _mm_add_ps(mTApTD, mTBpTC));
__m128 mResultTB = _mm_add_ps(_mm_mul_ps(_mm_set1_ps(aDctCoefs[1]), mTAsTD),_mm_mul_ps(_mm_set1_ps(aDctCoefs[3]), mTBsTC));
__m128 mResultTC = _mm_mul_ps(_mm_set1_ps(aDctCoefs[2]), _mm_sub_ps(mTApTD, mTBpTC));
__m128 mResultTD = _mm_sub_ps(_mm_mul_ps(_mm_set1_ps(aDctCoefs[3]), mTAsTD), _mm_mul_ps(_mm_set1_ps(aDctCoefs[1]), mTBsTC));
mInOutTA = mResultTA;
mInOutTB = mResultTB;
mInOutTC = mResultTC;
mInOutTD = mResultTD;
}
The baseline implementation took 4 multiplies and 3 adds. The above implementation takes 8 adds and 6 multiples to do 4 times as much work (also less space is taken up by constants).
No comments:
Post a Comment