 # SIMD Optimization of IIR Filters

02 Mar 2012, 17:55 UTC

IIR filters seem impossible to convert to SIMD code, since they have a tight loop containing an immediate dependency. This article will show you that it's not impossible, given the right maths.

The following code is adapted from ITU G729C's post_process() function:

``````// float xm2 = signal[-2]
// float xm1 = signal[-1]
// float ym2 = output[-2]
// float ym1 = output[-1]
// these come from a state block
for(i=0; i < L_FRAME; i++)
{
// new input sample
float x0 = signal[i];

// calculate output sample
float y0 = ym1*a1 + ym2*a2 + x0*b0 + xm1*b1 + xm2*b2;

// write output sample
output[i] = y0;

// update recurrence
ym2 = ym1;
ym1 = y0;
xm2 = xm1;
xm1 = x0;
}
// write xm1,xm2,ym1,ym2 to state block
``````

Note that `y0` when i=1 (let's notate this `y`) depends on the value of `y0` when i=0 (`y`), which is carried into the loop iteration in `ym1`. So it doesn't seem as if you can calculate `y` and `y` simultaneously. Hence, the code does not look like it can be SIMDified.

However, the code is based on linear maths. Let's ignoring floating-point errors for a moment, and see how we can SIMDify this code.

First, note that `x`* and `y`* just refer to different places in the input and output streams. That is `xm1[i]` = `x[i-1]` and `xm2[i]` = `x[i-2]`. Similarly, `ym1[i]` = `y[i-1]` and `ym2[i]` = `y[i-2]`. With this in mind we can write an equation for `y[i]` in terms of just `x[]` and `y[]` as follows:

``````y[i] = y[i-1]*a1 + y[i-2]*a2 + x[i]*b0 + x[i-1]*b1 + x[i-2]*b2
``````

For SIMDification we want to calculate `y[i]`,`y[i+1]`,`y[i+2]` and `y[i+3]` at once (for 4-way SIMD). Let's write these all out longhand now:

``````y[i] = y[i-1]*a1 + y[i-2]*a2 + x[i]*b0 + x[i-1]*b1 + x[i-2]*b2
y[i+1] = y[i]*a1 + y[i-1]*a2 + x[i+1]*b0 + x[i]*b1 + x[i-1]*b2
y[i+2] = y[i+1]*a1 + y[i]*a2 + x[i+2]*b0 + x[i+1]*b1 + x[i]*b2
y[i+3] = y[i+2]*a1 + y[i+1]*a2 + x[i+3]*b0 + x[i+2]*b1 + x[i+1]*b2
``````

Now we can write this out as a matrix:

``````          x[i+3] x[i+2] x[i+1] x[i] x[i-1] x[i-2] y[i+2] y[i+1] y[i] y[i-1] y[i-2]
y[i]   =    0      0      0     b0    b1     b2     0      0     0     a1     a2
y[i+1] =    0      0      b0    b1    b2     0      0      0     a1    a2     0
y[i+2] =    0      b0     b1    b2    0      0      0      a1    a2    0      0
y[i+3] =    b0     b1     b2    0     0      0      a1     a2    0     0      0
``````

To calculate `y[i...i+3]` we need to essentially "solve" this matrix. To do this, we perform a technique which is really just Gaussian elimination. Pull out the columns of `y[i+2]`,`y[i+1]` and `y[i]`, which are not known in advance, as follows:

``````          x[i+3] x[i+2] x[i+1] x[i] x[i-1] x[i-2] y[i-1] y[i-2]
y[i]   =    0      0      0     b0    b1     b2     a1     a2
y[i+1] =    0      0      b0    b1    b2     0      a2     0	+ a1*y[i]
y[i+2] =    0      b0     b1    b2    0      0      0      0	+ a1*y[i+1] + a1*y[i]
y[i+3] =    b0     b1     b2    0     0      0      0      0	+ a1*y[i+2] + a2*y[i+1]
``````

Now, since we know the coefficients of `y[i]` only using the input data, we can calculate coefficients of `y[i+1]` which only use the input data, by adding `a1` * row(`y[i]`) to row(`y[i+1]`). By proceeding to `y[i+2]` and `y[i+3]` we can eliminate all the dependence on y values we don't yet have.

I won't write the solved matrix here - it is easiest to do this numerically using code like the following:

``````double	coeffs =
{
{ 0,  0,  0,  b0, b1, b2, a1, a2 },
{ 0,  0,  b0, b1, b2, 0,  a2, 0  },
{ 0,  b0, b1, b2, 0,  0,  0,  0  },
{ b0, b1, b2, 0,  0,  0,  0,  0  },
};

for (int ii = 0; ii < 8; ii++)
{
// add a1*row to row
coeffs[ii] += a1 * coeffs[ii];
// add a1*row + a2*row to row 2
coeffs[ii] += a1 * coeffs[ii] + a2 * coeffs[ii];
// add a1*row + a2*row to row 3
coeffs[ii] += a1 * coeffs[ii] + a2 * coeffs[ii];
}
``````

The original coefficients are floats, from working code. We use double precision for these calculations so that the final results are correct to 1ulp in floating-point precision.

Now we have this coefficient matrix, we can generate SIMD float coefficients. Each is a coefficient of a single input value, so we just transpose the array as follows:

``````vector float	coeff_xp3 = { coeffs, coeffs, coeffs, coeffs };
vector float	coeff_xp2 = { coeffs, coeffs, coeffs, coeffs };
vector float	coeff_xp1 = { coeffs, coeffs, coeffs, coeffs };
vector float	coeff_x0  = { coeffs, coeffs, coeffs, coeffs };
vector float	coeff_xm1 = { coeffs, coeffs, coeffs, coeffs };
vector float	coeff_xm2 = { coeffs, coeffs, coeffs, coeffs };
vector float	coeff_ym1 = { coeffs, coeffs, coeffs, coeffs };
vector float	coeff_ym2 = { coeffs, coeffs, coeffs, coeffs };
``````

These are now the coefficients of a SIMD IIR, which has the same structure as the scalar IIR but uses SIMD code and calculates four outputs at once. The code for this looks something like:

``````// convert from scalar state block
vector float	value_xm2 = broadcast(x2);
vector float	value_xm1 = broadcast(x1);
vector float	value_ym2 = broadcast(y2);
vector float	value_ym1 = broadcast(y1);

for (int ii = 0; ii < L_FRAME/4; ii++)
{
// new input samples
vector float	x0123 = *ptr++;	// load four floats
vector float	value_x0 = broadcast(x0123);
vector float	value_xp1 = broadcast(x0123);
vector float	value_xp2 = broadcast(x0123);
vector float	value_xp3 = broadcast(x0123);

// calculate output samples
vector float	y0123 = 0;
y0123 += coeff_xp3 * value_xp3;
y0123 += coeff_xp2 * value_xp2;
y0123 += coeff_xp1 * value_xp1;
y0123 += coeff_x0 * value_x0;
y0123 += coeff_xm1 * value_xm1;
y0123 += coeff_xm2 * value_xm2;
y0123 += coeff_ym1 * value_ym1;
y0123 += coeff_ym2 * value_ym2;

// write output samples
*out++ = y0123;

// update recurrence
value_xm2 = value_xp2;
value_xm1 = value_xp3;
}

// convert to scalar state block
x2 = value_xm2;
x1 = value_xm1;
y2 = value_ym2;
y1 = value_ym1;
``````

Note that the state block for the IIR is unchanged - it still holds four scalar values.

Although this code is correct it can be made faster by removing the tight recurrences from the calculation of `y0123`. I use this formulation:

``````vector float  y0123a = coeff_xp3 * value_xp3
+ coeff_xp2 * value_xp2
+ coeff_xp1 * value_xp1
+ coeff_x0 * value_x0;

vector float  y0123b = coeff_xm1 * value_xm1
+ coeff_xm2 * value_xm2
+ coeff_ym1 * value_ym1
+ coeff_ym2 * value_ym2;

vector float  y0123 = y0123a + y0123b;
``````

You can experiment with various numbers of parallel branches, although more parallel branches means more addition at the end, so overall speed may decrease.

If the compiler generates (or you use intrinsics for) fused multiply-add, numerical precision (and speed) is increased.

The stability of the resultant IIRs is not guaranteed - however in practice, using double-precision to calculate the coefficients, and using fused multiply-add to generate `y0123`, gives results which are very close to the results of the original IIR. In some cases the results can be better (more accurate). More study is needed to figure out if the SIMD IIR can be unstable when the scalar IIR is not.

The following properties of the coefficients should be noted:

1. The coefficients only change if the filter coefficients change
2. The structure of the code to calculate the coefficients only changes if the filter structure changes

So for constant filter coefficients, we can precompute the SIMD coefficients once for all time. For variable filter coefficients, but fixed structure, we add some code per frame to generate SIMD coefficients. For variable filter structure we must write a more general matrix solver.