Multiplying complex numbers together has the problem that the necessary negation is complicated in the SSE2 instruction set.
One thing that happens often with SSE2 programming is that sometimes it's better to make the problem more complicated. In this case it is often better to multiply two complex numbers by two other complex numbers at the same time.
__forceinline __m128 _mm_complexMultiplyTwo(const __m128 &ar_ai_cr_ci, const __m128 &br_bi_dr_di)
{
// this is a complex multiply that complex multiplies two sets of complex numbers at the same time.
// this would be lots easier with the SSE3 instructions.
// The number 0x80000000 reinterpreted as a float is negative zero.
// xor'ing against this constant will negate two numbers.
__declspec(align(16)) static const UINT32 neg02[4] = { 0x80000000, 0, 0x80000000 ,0 };
__m128 mneg02 = _mm_load_ps(reinterpret_cast<const float*>(&neg02[0]));
// if we call the input vectors (ar, ai, cr, ci) and (br, bi, dr, di)
// we are trying to get (ar*br - ai*bi, ai*br + ar*bi, cr*dr - ci*di, ci*dr + cr*di)
// so we need four multipies
__m128 ar_ar_cr_cr = _mm_shuffle_ps(ar_ai_cr_ci, ar_ai_cr_ci, _MM_SHUFFLE(2, 2, 0, 0));
__m128 ai_ai_ci_ci = _mm_shuffle_ps(ar_ai_cr_ci, ar_ai_cr_ci, _MM_SHUFFLE(3, 3, 1, 1));
__m128 bi_br_di_dr = _mm_shuffle_ps(br_bi_dr_di, br_bi_dr_di, _MM_SHUFFLE(2, 3, 0, 1));
__m128 arbr_arbi_crdr_crdi = _mm_mul_ps(ar_ar_cr_cr, br_bi_dr_di);
__m128 aibi_aibr_cidi_cidr = _mm_mul_ps(ai_ai_ci_ci, bi_br_di_dr);
__m128 Naibi_aibr_Ncidi_cidr = _mm_xor_ps(aibi_aibr_cidi_cidr, mneg1);
__m128 arbrMaibi_arbiPaibr_crdrMcidi_crciPcidr = _mm_add_ps(arbr_arbi_crdr_crdi, Naibi_aibr_Ncidi_cidr);
return arbrMaibi_arbiPaibr_crdrMcidi_crciPcidr;
}
This compiles to:
movaps xmm0,xmm3
movaps xmm1,xmm2
shufps xmm0,xmm3,0B1h
shufps xmm1,xmm2,0F5h
mulps xmm1,xmm0
shufps xmm2,xmm2,0A0h
mulps xmm2,xmm3
xorps xmm1,xmmword ptr [`_mm_complexMultiplyTwo'::`2'::neg1 (010D2100h)]
addps xmm1,xmm2
This looks pretty good. In particular it uses very few registers. The one problem is that load from a constant. But most of the time we are performing this in a loop, and the compiler is good enough to factor that out of the loop for us and use one of the other registers.
It takes two moves, three shuffles, two multiplies, an add, and an xor against a memory read. Really that's only one additional multiply from the previously discussed multiplying together just one set of complex numbers.
No comments:
Post a Comment