The _mm_mul_ps allows us to parallel multiply two sets of four numbers as a single instruction. I remember something about something similar from a high-school math class.
It takes four real multiplies to perform a single complex multiply. Let's figure out how to make this work more.
The problem we have is that horrible minus sign on the right hand side of the equals. Until SSE3 there wasn't an add instruction that did both add and subtract.
And we need to get all of the results lined up at the end. For the purpose of discussion I'm going to provide a solution:
__m128 _mm_complexMultiply(__m128 ari, __m128 bri)
{
static const UINT uiNegativeZero = 0x80000000;
__m128 Ar_Ai_Ai_Ar = _mm_shuffle_ps(ari, ari, _MM_SHUFFLE(0, 1, 1, 0));
__m128 z_z_N_z= _mm_setr_ps(0,0,*(reinterpret_cast<const float *>(&uiNegativeZero)),0);
__m128 Ar_Ai_NAi_Ar = _mm_xor_ps(Ar_Ai_Ai_Ar, z_z_N_z);
__m128 Br_Br_Bi_Bi = _mm_shuffle_ps(bri, bri, _MM_SHUFFLE(1, 1, 0, 0));
__m128 mMul = _mm_mul_ps(Br_Br_Bi_Bi, Ar_Ai_NAi_Ar);
__m128 ArBrmAiBi_AiBrpArBi_x_x= _mm_add_ps(mMul, _mm_movehl_ps(Br_Br_Bi_Bi, mMul));
return ArBrmAiBi_AiBrpArBi_x_x;
}
Looks pretty good once compiled:
00DC1176 movq xmm0,mmword ptr [edx]
00DC117A movq xmm1,mmword ptr [ecx]
00DC117E mov eax,dword ptr [vri]
00DC1181 shufps xmm0,xmm0,14h
00DC1185 xorps xmm0,xmmword ptr [__xmm@00000000800000000000000000000000 (0DC2110h)]
00DC118C shufps xmm1,xmm1,50h
00DC1190 mulps xmm0,xmm1
00DC1193 movhlps xmm1,xmm0
00DC1196 addps xmm1,xmm0
00DC1199 movq mmword ptr [eax],xmm1
Yup - that still looks okay.
The above has two shuffles but is designed to minimize the number of operations after the multiply. It seems like there should be a way of reducing latency until there.
While there seem to be ways of reducing latency for a single multiply, those ways seem to preclude better things. I ask you to trust me on this.
Okay news: The above does leave meaningful garbage in the upper half of the XMM register. But that's easy to zero out or ignore during the copy to memory.
Good news: The above gets it down to two shuffles, one xor, one multiply, and one movehl, and one add. But much of the time we will be multiplying a single vector against a large bunch of other runtime vectors (As is the case with the implementation of FFT as described in the book "Numerical Recipes in C") . (some of this is described in an Intel blog post that I remember but can no longer find despite deep google searches). This allows easy optimization of computing the vector for A and using many times with different B vectors.
Okay news: This is pretty efficient. The place where we can improve is in that horrible negative. While yes, a complex multiplication could be performed in three multiplies, the cost of additions, subtractions, and shuffles with the SSE2 instruction set would overwhelm this strategy with a register width of 128bits. We are pretty much never going to do better than one 4 element multiply per complex multiplication.
Good news: Within the SSE2 instruction set we can however spread the overhead cost in a couple ways. One way is to factor the A shuffle and xor out, by multiplying a single complex number against a multitude of other complex numbers.
Good news if we can make it work: Another way to address this problem is to spread the negation over several multiplications. There are already enough shuffles floating around to make it possible to negate several complex multiplications in a single xor.
We do lose a cycle for each transition between float and int. The observant reader will note three of these in the above code. But I think it's worth it.
We will address performing multiple complex multiplications concurrently in a future post.
Yup. This is a cliffhanger. I'm betting that it is worth a couple machine cycles to read the followup. :)
No comments:
Post a Comment