Thursday, December 15, 2016

Multiplying two complex numbers by two other complex numbers

Multiplying complex numbers together has the problem that the necessary negation is complicated in the SSE2 instruction set.

One thing that happens often with SSE2 programming is that sometimes it's better to make the problem more complicated.  In this case it is often better to multiply two complex numbers by two other complex numbers at the same time.

__forceinline __m128 _mm_complexMultiplyTwo(const __m128 &ar_ai_cr_ci, const __m128 &br_bi_dr_di)
{
    // this is a complex multiply that complex multiplies two sets of complex numbers at the same time.
    // this would be lots easier with the SSE3 instructions.

    // The number 0x80000000 reinterpreted as a float is negative zero.
    // xor'ing against this constant will negate two numbers.
    __declspec(align(16)) static const UINT32 neg02[4] = { 0x80000000, 0, 0x80000000 ,0 };
    __m128 mneg02 = _mm_load_ps(reinterpret_cast<const float*>(&neg02[0]));

    // if we call the input vectors (ar, ai, cr, ci) and (br, bi, dr, di)
    // we are trying to get (ar*br - ai*bi, ai*br + ar*bi, cr*dr - ci*di, ci*dr + cr*di)
    // so we need four multipies

    __m128 ar_ar_cr_cr = _mm_shuffle_ps(ar_ai_cr_ci, ar_ai_cr_ci, _MM_SHUFFLE(2, 2, 0, 0));
    __m128 ai_ai_ci_ci = _mm_shuffle_ps(ar_ai_cr_ci, ar_ai_cr_ci, _MM_SHUFFLE(3, 3, 1, 1));
    __m128 bi_br_di_dr = _mm_shuffle_ps(br_bi_dr_di, br_bi_dr_di, _MM_SHUFFLE(2, 3, 0, 1));

    __m128 arbr_arbi_crdr_crdi = _mm_mul_ps(ar_ar_cr_cr, br_bi_dr_di);
    __m128 aibi_aibr_cidi_cidr = _mm_mul_ps(ai_ai_ci_ci, bi_br_di_dr);

    __m128 Naibi_aibr_Ncidi_cidr = _mm_xor_ps(aibi_aibr_cidi_cidr, mneg1);
    __m128 arbrMaibi_arbiPaibr_crdrMcidi_crciPcidr = _mm_add_ps(arbr_arbi_crdr_crdi, Naibi_aibr_Ncidi_cidr);
    return arbrMaibi_arbiPaibr_crdrMcidi_crciPcidr;
}


This compiles to:

movaps      xmm0,xmm3 
movaps      xmm1,xmm2 
shufps      xmm0,xmm3,0B1h 
shufps      xmm1,xmm2,0F5h 
mulps       xmm1,xmm0 
shufps      xmm2,xmm2,0A0h 
mulps       xmm2,xmm3 
xorps       xmm1,xmmword ptr [`_mm_complexMultiplyTwo'::`2'::neg1 (010D2100h)] 
addps       xmm1,xmm2 

This looks pretty good.  In particular it uses very few registers.  The one problem is that load from a constant.  But most of the time we are performing this in a loop, and the compiler is good enough to factor that out of the loop for us and use one of the other registers.

It takes two moves, three shuffles, two multiplies, an add, and an xor against a memory read. Really that's only one additional multiply from the previously discussed multiplying together just one set of complex numbers.

No comments:

Post a Comment