Monday, May 29, 2017

float bilinear interpolation using SSE2

Bilinear interpolation is an easy way to smooth out sampled data.  The sampled value is known at the corners of the unit square.  Interpolation is performed linearly in each dimension.

Basically this boils down to evaluating the function:



void bilinearInterpolate_ANSI(__in_ecount(2) const float * const pXY, __in_ecount(2) const float *const pF0, __in_ecount(2) const float *const pF1, __out float *const pResult)
{
    const float &x = pXY[0];
    const float &y = pXY[1];

    *pResult = pF0[0] * (1 - x)*(1 - y) + pF0[1] * x*(1 - y) + pF1[0] * (1 - x)*y + pF1[1] * x*y;
}

That addition is either going to require a horizontal add, or will require something like a transpose for the functional values.  I've chosen to go with the horizontal addition route.

For the moment, let's look at everything up to multiplying by the f values.  It's another case where it's close to the same cost to evaluate two as to evaluate one.

void bilinearInterpolateInner_SSE(const __m128 &mXY, __out __m128 &mLeft, __out __m128 &mRight)
{
    __m128 m1111 = _mm_set1_ps(1);
    __m128 mOmXY = _mm_sub_ps(m1111, mXY);

    __m128 mYLeft = _mm_shuffle_ps(mOmXY, mXY, _MM_SHUFFLE(1, 1, 1, 1));
    __m128 mYRight = _mm_shuffle_ps(mOmXY, mXY, _MM_SHUFFLE(3, 3, 3, 3));

    __m128 mXYOmXYLeft = _mm_unpacklo_ps(mXY, mOmXY);
    __m128 mXYOmXYRight = _mm_unpackhi_ps(mXY, mOmXY);
    __m128 mXLeft = _mm_shuffle_ps(mXYOmXYLeft, mXYOmXYLeft, _MM_SHUFFLE(0, 1, 0, 1));
    __m128 mXRight = _mm_shuffle_ps(mXYOmXYRight, mXYOmXYRight, _MM_SHUFFLE(0, 1, 0, 1));

    mLeft = _mm_mul_ps(mXLeft, mYLeft);
    mRight = _mm_mul_ps(mXRight, mYRight);
}

If we have loaded the values for f into two SSE registers, this can be evaluated as follows:

__forceinline __m128 _mm_hadd_ps_SSE2(const __m128 mLeft, const __m128 mRight)
{
    __m128 mTop = _mm_shuffle_ps(mLeft, mRight, _MM_SHUFFLE(0, 2, 0, 2));
    __m128 mBottom = _mm_shuffle_ps(mLeft, mRight, _MM_SHUFFLE(1, 3, 1, 3));
    return _mm_add_ps(mTop, mBottom);
}

__forceinline void store2floats(const __m128 &mFloats, __out_ecount(2) float * const pfResults)
{
    _mm_store_sd(reinterpret_cast<double *>(pfResults), _mm_castps_pd(mFloats));
}

void bilinearInterpolateMiddle_SSE(const __m128 &mXYLeftRight, const __m128 &mFLeft, const __m128 &mFRight, __out_ecount(2) float * const pfResults)
{
    __m128 mLeft, mRight;
    bilinearInterpolateInner_SSE(mXYLeftRight, mLeft, mRight);
    __m128 mInt = _mm_hadd_ps_SSE2(_mm_mul_ps(mRight,mFRight),_mm_mul_ps(mLeft,mFLeft));
    __m128 mRet = _mm_hadd_ps_SSE2(mInt, mInt);
    store2floats(mRet, pfResults);
}

This does the final multiply against the f values.  And then does the horizontal sum.  The two horizontal sums are performed simultaneously.  The SSE3 _mm_hadd_ps instruction would be helpful, but it is sumulated using SSE2 in the above example.

No comments:

Post a Comment