Sunday, December 25, 2016

different ways to blend

SSE2 has the instruction _mm_andnot_si128 that is included for the purpose of allowing bitwise blend operations:

__forceinline __m128i _mm_bitwise_blend_v1_si128(__m128i a, __m128i b, __m128i mask)
{
    return _mm_or_si128(_mm_and_si128(mask, b), _mm_andnot_si128(mask, a));
}


This gives bitwise a when mask is 0, and b when mask is 1.

This allows us to do things like minimum and maximum even without the _mm_max_epi32 instruction introduced with SSE41:

__m128i m_a_lt_b = _mm_cmplt_epi32(a, b);
__m128i m_min = _mm_bitwise_blend_v1_si128(b, a, m_a_lt_b);
__m128i m_max = _mm_bitwise_blend_v1_si128(a, b, m_a_lt_b);


There are other ways of performing the same blend operation:
__forceinline __m128i _mm_bitwise_blend_v2_si128(__m128i a, __m128i b, __m128i mask)
{
    return _mm_xor_si128(a, _mm_and_si128(_mm_xor_si128(a, b), mask));
}


This would sometimes be better than the first version.  The generated assembly of the first version would complete with mask in a register.  The generated assembly of the second version has the benefit of leaving a in a register afterwards. 

This benefit of the blend was formalized into the instruction set with the SSE41 instruction _mm_blendv_epi8 and others.  When the processor supports SSE41 this is much better.

No comments:

Post a Comment