Reanimator Ltd

High-performance coding by Eddie Edwards

SIMD Optimization of IIR Filters

02 Mar 2012, 17:55 UTC

IIR filters seem impossible to convert to SIMD code, since they have a tight loop containing an immediate dependency. This article will show you that it's not impossible, given the right maths.

The following code is adapted from ITU G729C's post_process() function:

// float xm2 = signal[-2]
// float xm1 = signal[-1]
// float ym2 = output[-2]
// float ym1 = output[-1]
// these come from a state block
for(i=0; i < L_FRAME; i++)
	// new input sample
	float x0 = signal[i];

	// calculate output sample
	float y0 = ym1*a1 + ym2*a2 + x0*b0 + xm1*b1 + xm2*b2;

	// write output sample
	output[i] = y0;

	// update recurrence
	ym2 = ym1;
	ym1 = y0;
	xm2 = xm1;
	xm1 = x0;
// write xm1,xm2,ym1,ym2 to state block

Note that y0 when i=1 (let's notate this y[1]) depends on the value of y0 when i=0 (y[0]), which is carried into the loop iteration in ym1. So it doesn't seem as if you can calculate y[0] and y[1] simultaneously. Hence, the code does not look like it can be SIMDified.

However, the code is based on linear maths. Let's ignoring floating-point errors for a moment, and see how we can SIMDify this code.

First, note that x* and y* just refer to different places in the input and output streams. That is xm1[i] = x[i-1] and xm2[i] = x[i-2]. Similarly, ym1[i] = y[i-1] and ym2[i] = y[i-2]. With this in mind we can write an equation for y[i] in terms of just x[] and y[] as follows:

y[i] = y[i-1]*a1 + y[i-2]*a2 + x[i]*b0 + x[i-1]*b1 + x[i-2]*b2

For SIMDification we want to calculate y[i],y[i+1],y[i+2] and y[i+3] at once (for 4-way SIMD). Let's write these all out longhand now:

y[i] = y[i-1]*a1 + y[i-2]*a2 + x[i]*b0 + x[i-1]*b1 + x[i-2]*b2
y[i+1] = y[i]*a1 + y[i-1]*a2 + x[i+1]*b0 + x[i]*b1 + x[i-1]*b2
y[i+2] = y[i+1]*a1 + y[i]*a2 + x[i+2]*b0 + x[i+1]*b1 + x[i]*b2
y[i+3] = y[i+2]*a1 + y[i+1]*a2 + x[i+3]*b0 + x[i+2]*b1 + x[i+1]*b2

Now we can write this out as a matrix:

          x[i+3] x[i+2] x[i+1] x[i] x[i-1] x[i-2] y[i+2] y[i+1] y[i] y[i-1] y[i-2]
y[i]   =    0      0      0     b0    b1     b2     0      0     0     a1     a2
y[i+1] =    0      0      b0    b1    b2     0      0      0     a1    a2     0
y[i+2] =    0      b0     b1    b2    0      0      0      a1    a2    0      0
y[i+3] =    b0     b1     b2    0     0      0      a1     a2    0     0      0

To calculate y[i...i+3] we need to essentially "solve" this matrix. To do this, we perform a technique which is really just Gaussian elimination. Pull out the columns of y[i+2],y[i+1] and y[i], which are not known in advance, as follows:

          x[i+3] x[i+2] x[i+1] x[i] x[i-1] x[i-2] y[i-1] y[i-2]
y[i]   =    0      0      0     b0    b1     b2     a1     a2
y[i+1] =    0      0      b0    b1    b2     0      a2     0	+ a1*y[i]
y[i+2] =    0      b0     b1    b2    0      0      0      0	+ a1*y[i+1] + a1*y[i]
y[i+3] =    b0     b1     b2    0     0      0      0      0	+ a1*y[i+2] + a2*y[i+1]

Now, since we know the coefficients of y[i] only using the input data, we can calculate coefficients of y[i+1] which only use the input data, by adding a1 * row(y[i]) to row(y[i+1]). By proceeding to y[i+2] and y[i+3] we can eliminate all the dependence on y values we don't yet have.

I won't write the solved matrix here - it is easiest to do this numerically using code like the following:

double	coeffs[4][8] =
	{ 0,  0,  0,  b0, b1, b2, a1, a2 },
	{ 0,  0,  b0, b1, b2, 0,  a2, 0  },
	{ 0,  b0, b1, b2, 0,  0,  0,  0  },
	{ b0, b1, b2, 0,  0,  0,  0,  0  },

for (int ii = 0; ii < 8; ii++)
	// add a1*row[0] to row[1]
	coeffs[1][ii] += a1 * coeffs[0][ii];
	// add a1*row[1] + a2*row[0] to row 2
	coeffs[2][ii] += a1 * coeffs[1][ii] + a2 * coeffs[0][ii];
	// add a1*row[2] + a2*row[1] to row 3
	coeffs[3][ii] += a1 * coeffs[2][ii] + a2 * coeffs[1][ii];

The original coefficients are floats, from working code. We use double precision for these calculations so that the final results are correct to 1ulp in floating-point precision.

Now we have this coefficient matrix, we can generate SIMD float coefficients. Each is a coefficient of a single input value, so we just transpose the array as follows:

vector float	coeff_xp3 = { coeffs[0][0], coeffs[1][0], coeffs[2][0], coeffs[3][0] };
vector float	coeff_xp2 = { coeffs[0][1], coeffs[1][1], coeffs[2][1], coeffs[3][1] };
vector float	coeff_xp1 = { coeffs[0][2], coeffs[1][2], coeffs[2][2], coeffs[3][2] };
vector float	coeff_x0  = { coeffs[0][3], coeffs[1][3], coeffs[2][3], coeffs[3][3] };
vector float	coeff_xm1 = { coeffs[0][4], coeffs[1][4], coeffs[2][4], coeffs[3][4] };
vector float	coeff_xm2 = { coeffs[0][5], coeffs[1][5], coeffs[2][5], coeffs[3][5] };
vector float	coeff_ym1 = { coeffs[0][6], coeffs[1][6], coeffs[2][6], coeffs[3][6] };
vector float	coeff_ym2 = { coeffs[0][7], coeffs[1][7], coeffs[2][7], coeffs[3][7] };

These are now the coefficients of a SIMD IIR, which has the same structure as the scalar IIR but uses SIMD code and calculates four outputs at once. The code for this looks something like:

// convert from scalar state block
vector float	value_xm2 = broadcast(x2);
vector float	value_xm1 = broadcast(x1);
vector float	value_ym2 = broadcast(y2);
vector float	value_ym1 = broadcast(y1);

for (int ii = 0; ii < L_FRAME/4; ii++)
	// new input samples
	vector float	x0123 = *ptr++;	// load four floats
	vector float	value_x0 = broadcast(x0123[0]);
	vector float	value_xp1 = broadcast(x0123[1]);
	vector float	value_xp2 = broadcast(x0123[2]);
	vector float	value_xp3 = broadcast(x0123[3]);

	// calculate output samples
	vector float	y0123 = 0;
	y0123 += coeff_xp3 * value_xp3;
	y0123 += coeff_xp2 * value_xp2;
	y0123 += coeff_xp1 * value_xp1;
	y0123 += coeff_x0 * value_x0;
	y0123 += coeff_xm1 * value_xm1;
	y0123 += coeff_xm2 * value_xm2;
	y0123 += coeff_ym1 * value_ym1;
	y0123 += coeff_ym2 * value_ym2;

	// write output samples
	*out++ = y0123;

	// update recurrence
	value_xm2 = value_xp2;
	value_xm1 = value_xp3;
	value_ym2 = broadcast(y0123[2]);
	value_ym1 = broadcast(y0123[3]);

// convert to scalar state block
x2 = value_xm2[0];
x1 = value_xm1[0];
y2 = value_ym2[0];
y1 = value_ym1[0];

Note that the state block for the IIR is unchanged - it still holds four scalar values.

Although this code is correct it can be made faster by removing the tight recurrences from the calculation of y0123. I use this formulation:

vector float  y0123a = coeff_xp3 * value_xp3
                     + coeff_xp2 * value_xp2
                     + coeff_xp1 * value_xp1
                     + coeff_x0 * value_x0;

vector float  y0123b = coeff_xm1 * value_xm1
                     + coeff_xm2 * value_xm2
                     + coeff_ym1 * value_ym1
                     + coeff_ym2 * value_ym2;

vector float  y0123 = y0123a + y0123b;

You can experiment with various numbers of parallel branches, although more parallel branches means more addition at the end, so overall speed may decrease.

If the compiler generates (or you use intrinsics for) fused multiply-add, numerical precision (and speed) is increased.

The stability of the resultant IIRs is not guaranteed - however in practice, using double-precision to calculate the coefficients, and using fused multiply-add to generate y0123, gives results which are very close to the results of the original IIR. In some cases the results can be better (more accurate). More study is needed to figure out if the SIMD IIR can be unstable when the scalar IIR is not.

The following properties of the coefficients should be noted:

  1. The coefficients only change if the filter coefficients change
  2. The structure of the code to calculate the coefficients only changes if the filter structure changes

So for constant filter coefficients, we can precompute the SIMD coefficients once for all time. For variable filter coefficients, but fixed structure, we add some code per frame to generate SIMD coefficients. For variable filter structure we must write a more general matrix solver.

New comments are disabled for this page

Yale Zhang commented:

Eddie, that's a clever SIMDization strategy and a good, clear writeup. I only thought of the naive method, which is to compute 1 output/iteration, but the problem would be that if the filter size < SIMDwidth, then there's waste and that the sum calculation can't be completely parallelized. Clearly, your SIMDization method works for any filter size. But the downside is that it's trading more parallelism for more redundancy since in the scalar code, you only need 5 multiplies and 4 adds per element, while in the SIMD version, there's 32/4 multiplies and 32/4 adds per element. Fair enough.

I'm interested in speeding up Inkscape's Gaussian blurs which are really slowing down my workflow. It's implemented as a 2D separable filter, meaning 2 IIR passes. Since there are 2 dimensions to work with, I will 1st try to apply SIMD to the outer dimension for which every element is independent. The cost will be that I need to transpose the input for 1 of the passes and to be cache friendly, I will need to do it a block at a time.

on 10 Mar 2013, 12:49 UTC