Thursday, December 15, 2016

complex multiplies using SSE2

The _mm_mul_ps allows us to parallel multiply two sets of four numbers as a single instruction.  I remember something about something similar from a high-school math class.

It takes four real multiplies to perform a single complex multiply.  Let's figure out how to make this work more.



The problem we have is that horrible minus sign on the right hand side of the equals.  Until SSE3 there wasn't an add instruction that did both add and subtract. 

And we need to get all of the results lined up at the end.  For the purpose of discussion I'm going to provide a solution:

 __m128 _mm_complexMultiply(__m128 ari, __m128 bri)
{
    static const UINT uiNegativeZero = 0x80000000;
    __m128 Ar_Ai_Ai_Ar = _mm_shuffle_ps(ari, ari, _MM_SHUFFLE(0, 1, 1, 0));
    __m128 z_z_N_z= _mm_setr_ps(0,0,*(reinterpret_cast<const float *>(&uiNegativeZero)),0);
    __m128 Ar_Ai_NAi_Ar = _mm_xor_ps(Ar_Ai_Ai_Ar, z_z_N_z);


    __m128 Br_Br_Bi_Bi = _mm_shuffle_ps(bri, bri, _MM_SHUFFLE(1, 1, 0, 0));
    __m128 mMul = _mm_mul_ps(Br_Br_Bi_Bi, Ar_Ai_NAi_Ar);
    __m128 ArBrmAiBi_AiBrpArBi_x_x= _mm_add_ps(mMul, _mm_movehl_ps(Br_Br_Bi_Bi, mMul));

    return ArBrmAiBi_AiBrpArBi_x_x;
}

Looks pretty good once compiled:
00DC1176  movq        xmm0,mmword ptr [edx] 
00DC117A  movq        xmm1,mmword ptr [ecx] 
00DC117E  mov         eax,dword ptr [vri] 
00DC1181  shufps      xmm0,xmm0,14h 
00DC1185  xorps       xmm0,xmmword ptr [__xmm@00000000800000000000000000000000 (0DC2110h)] 
00DC118C  shufps      xmm1,xmm1,50h 
00DC1190  mulps       xmm0,xmm1 
00DC1193  movhlps     xmm1,xmm0 
00DC1196  addps       xmm1,xmm0 
00DC1199  movq        mmword ptr [eax],xmm1 

Yup - that still looks okay.

The above has two shuffles but is designed to minimize the number of operations after the multiply.  It seems like there should be a way of reducing latency until there.

While there seem to be ways of reducing latency for a single multiply, those ways seem to preclude better things.  I ask you to trust me on this.

Okay news: The above does leave meaningful garbage in the upper half of the XMM register.  But that's easy to zero out or ignore during the copy to memory.

Good news: The above gets it down to two shuffles, one xor, one multiply, and one movehl, and one add.  But much of the time we will be multiplying a single vector against a large bunch of other runtime vectors (As is the case with the implementation of FFT as described in the book "Numerical Recipes in C") .  (some of this is described in an Intel blog post that I remember but can no longer find despite deep google searches).  This allows easy optimization of computing the vector for A and using many times with different B vectors.

Okay news: This is pretty efficient. The place where we can improve is in that horrible negative.  While yes, a complex multiplication could be performed in three multiplies, the cost of additions, subtractions, and shuffles with the SSE2 instruction set would overwhelm this strategy with a register width of 128bits.  We are pretty much never going to do better than one 4 element multiply per complex multiplication.

Good news: Within the SSE2 instruction set we can however spread the overhead cost in a couple ways.  One way is to factor the A shuffle and xor out, by multiplying a single complex number against a multitude of other complex numbers.

Good news if we can make it work: Another way to address this problem is to spread the negation over several multiplications.  There are already enough shuffles floating around to make it possible to negate several complex multiplications in a single xor.

We do lose a cycle for each transition between float and int.  The observant reader will note three of these in the above code.  But I think it's worth it.

We will address performing multiple complex multiplications concurrently in a future post.

Yup.  This is a cliffhanger.  I'm betting that it is worth a couple machine cycles to read the followup. :)

No comments:

Post a Comment