SSE2 has the instruction _mm_andnot_si128 that is included for the purpose of allowing bitwise blend operations:
__forceinline __m128i _mm_bitwise_blend_v1_si128(__m128i a, __m128i b, __m128i mask)
{
return _mm_or_si128(_mm_and_si128(mask, b), _mm_andnot_si128(mask, a));
}
This gives bitwise a when mask is 0, and b when mask is 1.
This allows us to do things like minimum and maximum even without the _mm_max_epi32 instruction introduced with SSE41:
__m128i m_a_lt_b = _mm_cmplt_epi32(a, b);
__m128i m_min = _mm_bitwise_blend_v1_si128(b, a, m_a_lt_b);
__m128i m_max = _mm_bitwise_blend_v1_si128(a, b, m_a_lt_b);
There are other ways of performing the same blend operation:
__forceinline __m128i _mm_bitwise_blend_v2_si128(__m128i a, __m128i b, __m128i mask)
{
return _mm_xor_si128(a, _mm_and_si128(_mm_xor_si128(a, b), mask));
}
This would sometimes be better than the first version. The generated assembly of the first version would complete with mask in a register. The generated assembly of the second version has the benefit of leaving a in a register afterwards.
This benefit of the blend was formalized into the instruction set with the SSE41 instruction _mm_blendv_epi8 and others. When the processor supports SSE41 this is much better.
No comments:
Post a Comment