02 Mar 2012, 17:55 UTC
IIR filters seem impossible to convert to SIMD code, since they have a tight loop containing an immediate dependency. This article will show you that it's not impossible, given the right maths.
The following code is adapted from ITU G729C's post_process() function:
// float xm2 = signal[-2]
// float xm1 = signal[-1]
// float ym2 = output[-2]
// float ym1 = output[-1]
// these come from a state block
for(i=0; i < L_FRAME; i++)
{
// new input sample
float x0 = signal[i];
// calculate output sample
float y0 = ym1*a1 + ym2*a2 + x0*b0 + xm1*b1 + xm2*b2;
// write output sample
output[i] = y0;
// update recurrence
ym2 = ym1;
ym1 = y0;
xm2 = xm1;
xm1 = x0;
}
// write xm1,xm2,ym1,ym2 to state block
Note that y0
when i=1 (let's notate this y[1]
) depends on the value of y0
when i=0
(y[0]
), which is carried into the loop iteration in ym1
. So it doesn't seem as if you
can calculate y[0]
and y[1]
simultaneously. Hence, the code does not look like it
can be SIMDified.
However, the code is based on linear maths. Let's ignoring floating-point errors for a moment, and see how we can SIMDify this code.
First, note that x
* and y
* just refer to different places in the input and output streams.
That is xm1[i]
= x[i-1]
and xm2[i]
= x[i-2]
. Similarly, ym1[i]
= y[i-1]
and ym2[i]
= y[i-2]
.
With this in mind we can write an equation for y[i]
in terms of just x[]
and y[]
as follows:
y[i] = y[i-1]*a1 + y[i-2]*a2 + x[i]*b0 + x[i-1]*b1 + x[i-2]*b2
For SIMDification we want to calculate y[i]
,y[i+1]
,y[i+2]
and y[i+3]
at once (for 4-way SIMD).
Let's write these all out longhand now:
y[i] = y[i-1]*a1 + y[i-2]*a2 + x[i]*b0 + x[i-1]*b1 + x[i-2]*b2
y[i+1] = y[i]*a1 + y[i-1]*a2 + x[i+1]*b0 + x[i]*b1 + x[i-1]*b2
y[i+2] = y[i+1]*a1 + y[i]*a2 + x[i+2]*b0 + x[i+1]*b1 + x[i]*b2
y[i+3] = y[i+2]*a1 + y[i+1]*a2 + x[i+3]*b0 + x[i+2]*b1 + x[i+1]*b2
Now we can write this out as a matrix:
x[i+3] x[i+2] x[i+1] x[i] x[i-1] x[i-2] y[i+2] y[i+1] y[i] y[i-1] y[i-2]
y[i] = 0 0 0 b0 b1 b2 0 0 0 a1 a2
y[i+1] = 0 0 b0 b1 b2 0 0 0 a1 a2 0
y[i+2] = 0 b0 b1 b2 0 0 0 a1 a2 0 0
y[i+3] = b0 b1 b2 0 0 0 a1 a2 0 0 0
To calculate y[i...i+3]
we need to essentially "solve" this matrix. To do this, we perform
a technique which is really just Gaussian elimination. Pull out the columns of y[i+2]
,y[i+1]
and y[i]
, which
are not known in advance, as follows:
x[i+3] x[i+2] x[i+1] x[i] x[i-1] x[i-2] y[i-1] y[i-2]
y[i] = 0 0 0 b0 b1 b2 a1 a2
y[i+1] = 0 0 b0 b1 b2 0 a2 0 + a1*y[i]
y[i+2] = 0 b0 b1 b2 0 0 0 0 + a1*y[i+1] + a1*y[i]
y[i+3] = b0 b1 b2 0 0 0 0 0 + a1*y[i+2] + a2*y[i+1]
Now, since we know the coefficients of y[i]
only using the input data, we can calculate coefficients of y[i+1]
which only use
the input data, by adding a1
* row(y[i]
) to row(y[i+1]
). By proceeding to y[i+2]
and y[i+3]
we can eliminate all the
dependence on y values we don't yet have.
I won't write the solved matrix here - it is easiest to do this numerically using code like the following:
double coeffs[4][8] =
{
{ 0, 0, 0, b0, b1, b2, a1, a2 },
{ 0, 0, b0, b1, b2, 0, a2, 0 },
{ 0, b0, b1, b2, 0, 0, 0, 0 },
{ b0, b1, b2, 0, 0, 0, 0, 0 },
};
for (int ii = 0; ii < 8; ii++)
{
// add a1*row[0] to row[1]
coeffs[1][ii] += a1 * coeffs[0][ii];
// add a1*row[1] + a2*row[0] to row 2
coeffs[2][ii] += a1 * coeffs[1][ii] + a2 * coeffs[0][ii];
// add a1*row[2] + a2*row[1] to row 3
coeffs[3][ii] += a1 * coeffs[2][ii] + a2 * coeffs[1][ii];
}
The original coefficients are floats, from working code. We use double precision for these calculations so that the final results are correct to 1ulp in floating-point precision.
Now we have this coefficient matrix, we can generate SIMD float coefficients. Each is a coefficient of a single input value, so we just transpose the array as follows:
vector float coeff_xp3 = { coeffs[0][0], coeffs[1][0], coeffs[2][0], coeffs[3][0] };
vector float coeff_xp2 = { coeffs[0][1], coeffs[1][1], coeffs[2][1], coeffs[3][1] };
vector float coeff_xp1 = { coeffs[0][2], coeffs[1][2], coeffs[2][2], coeffs[3][2] };
vector float coeff_x0 = { coeffs[0][3], coeffs[1][3], coeffs[2][3], coeffs[3][3] };
vector float coeff_xm1 = { coeffs[0][4], coeffs[1][4], coeffs[2][4], coeffs[3][4] };
vector float coeff_xm2 = { coeffs[0][5], coeffs[1][5], coeffs[2][5], coeffs[3][5] };
vector float coeff_ym1 = { coeffs[0][6], coeffs[1][6], coeffs[2][6], coeffs[3][6] };
vector float coeff_ym2 = { coeffs[0][7], coeffs[1][7], coeffs[2][7], coeffs[3][7] };
These are now the coefficients of a SIMD IIR, which has the same structure as the scalar IIR but uses SIMD code and calculates four outputs at once. The code for this looks something like:
// convert from scalar state block
vector float value_xm2 = broadcast(x2);
vector float value_xm1 = broadcast(x1);
vector float value_ym2 = broadcast(y2);
vector float value_ym1 = broadcast(y1);
for (int ii = 0; ii < L_FRAME/4; ii++)
{
// new input samples
vector float x0123 = *ptr++; // load four floats
vector float value_x0 = broadcast(x0123[0]);
vector float value_xp1 = broadcast(x0123[1]);
vector float value_xp2 = broadcast(x0123[2]);
vector float value_xp3 = broadcast(x0123[3]);
// calculate output samples
vector float y0123 = 0;
y0123 += coeff_xp3 * value_xp3;
y0123 += coeff_xp2 * value_xp2;
y0123 += coeff_xp1 * value_xp1;
y0123 += coeff_x0 * value_x0;
y0123 += coeff_xm1 * value_xm1;
y0123 += coeff_xm2 * value_xm2;
y0123 += coeff_ym1 * value_ym1;
y0123 += coeff_ym2 * value_ym2;
// write output samples
*out++ = y0123;
// update recurrence
value_xm2 = value_xp2;
value_xm1 = value_xp3;
value_ym2 = broadcast(y0123[2]);
value_ym1 = broadcast(y0123[3]);
}
// convert to scalar state block
x2 = value_xm2[0];
x1 = value_xm1[0];
y2 = value_ym2[0];
y1 = value_ym1[0];
Note that the state block for the IIR is unchanged - it still holds four scalar values.
Although this code is correct it can be made faster by removing the tight recurrences from the calculation of
y0123
. I use this formulation:
vector float y0123a = coeff_xp3 * value_xp3
+ coeff_xp2 * value_xp2
+ coeff_xp1 * value_xp1
+ coeff_x0 * value_x0;
vector float y0123b = coeff_xm1 * value_xm1
+ coeff_xm2 * value_xm2
+ coeff_ym1 * value_ym1
+ coeff_ym2 * value_ym2;
vector float y0123 = y0123a + y0123b;
You can experiment with various numbers of parallel branches, although more parallel branches means more addition at the end, so overall speed may decrease.
If the compiler generates (or you use intrinsics for) fused multiply-add, numerical precision (and speed) is increased.
The stability of the resultant IIRs is not guaranteed - however in practice, using double-precision to calculate
the coefficients, and using fused multiply-add to generate y0123
, gives results which are very close to the results of the original IIR. In
some cases the results can be better (more accurate). More study is needed to figure out if the SIMD IIR can be unstable when the scalar IIR
is not.
The following properties of the coefficients should be noted:
So for constant filter coefficients, we can precompute the SIMD coefficients once for all time. For variable filter coefficients, but fixed structure, we add some code per frame to generate SIMD coefficients. For variable filter structure we must write a more general matrix solver.
Yale Zhang commented:
Eddie, that's a clever SIMDization strategy and a good, clear writeup. I only thought of the naive method, which is to compute 1 output/iteration, but the problem would be that if the filter size < SIMDwidth, then there's waste and that the sum calculation can't be completely parallelized. Clearly, your SIMDization method works for any filter size. But the downside is that it's trading more parallelism for more redundancy since in the scalar code, you only need 5 multiplies and 4 adds per element, while in the SIMD version, there's 32/4 multiplies and 32/4 adds per element. Fair enough.
I'm interested in speeding up Inkscape's Gaussian blurs which are really slowing down my workflow. It's implemented as a 2D separable filter, meaning 2 IIR passes. Since there are 2 dimensions to work with, I will 1st try to apply SIMD to the outer dimension for which every element is independent. The cost will be that I need to transpose the input for 1 of the passes and to be cache friendly, I will need to do it a block at a time.
on 10 Mar 2013, 12:49 UTC