10 May 2012, 16:31 UTC
Recently, I had a bug which only manifested around half the time. It was, perhaps unsurprisingly, a threading bug. It can be hard to tell what causes threading bugs since the "fault" often occurs in a different thread to the "symptom".
A good fallback if you can't find a bug is to use binary search on version control to isolate the version in which the bug was introduced. But this fails for non-deterministic bugs - if you get no crash in a version that is bugged, you would go forward by a step (which can be a large step), and you are moving further away from the true answer with no hope of return. Eventually you will get to a version that should not have the bug, but then crashes anyway (or vice versa) and you have to start again.
To combat this it's possible to run each version, say, 4 times. This gives you say a 1/16 chance of a "false negative" - that is, 4 runs without a crash, when the bug is in fact present. If you run 5 or 6 tests, though, this 1/16 chance becomes around 1/3, and is likely to trip you up. So you could run each version 8 times instead, to give a 1/256 chance of a "false negative". This would work better, but it's quite inefficient, especially if the test takes a while to execute.
So what is the best way to approach this problem?
After some discussions on Facebook (thanks Dave & Sergey!) I hit upon the idea of using Bayesian inference to zero in on the correct revision. The idea was something like this:
It turns out that this approach works very well, and you get a remarkably straightforward algorithm.
There are two difficult steps in this idea - choosing a version to test, and working out the revised probabilities given the results of a specific test.
Let's do the probability analysis first.
In elementary probability we were taught that, if A and B are independent random variables, then p(A & B) = p(A) * p(B).
Conditional probability deals with the case where A and B are not independent.
For instance, if A is the random variable that represents the program crashing (so A is true if the program crashes and false if the program does not crash) and B is the random variable that represents whether or not the program has a crash bug in it (so B is true if the program contains a bug, and false otherwise), then A and B are not independent. Specifically, if B is false then A is false, since the program will not crash if it does not contain a bug. So p(A & B) != p(A) * p(B).
To deal with this we can introduce the notion of conditional probability. What is the probability that the program crashes, given that the bug is in fact present, for instance. We can answer one such question immediately - the probability that the program crashes given that the bug is not present is zero. We can write this as follows:
p(A | !B) = 0
Conditional probability is calculated as follows:
p(A | B) = p(A & B) / p(B)
We can justify this formula using a truth table, listing all possible values for A & B:
A B
0 0 p(!A & !B)
0 1 p(!A & B)
1 0 p(A & !B)
1 1 p(A & B)
Note that the probabilities on the right sum to 1, since one of the combinations must occur.
If we know B occurs then we can scratch out all entries with B=0:
A B
0 1 p(!A & B)
1 1 p(A & B)
But one of these events must happen, so the probabilities must sum to 1. Currently, they sum to p(!A & B) + p(A & B) = p(B). So to renormalize our truth table we should divide by p(B):
A B
0 1 p(!A & B) / p(B)
1 1 p(A & B) / p(B)
Hence,
p(A | B) = p(A & B) / p(B)
p(!A | B) = p(!A & B) / p(B)
Conditional probability is all about knowledge. Without knowing that B occurred, you only have p(A) that A will occur. But if you know B occurred then you can say that A is now more or less likely as a result. You have more knowledge, although you might now have less certainty about A.
Notice that if A & B are independent, then p(A & B) = p(A) * p(B), so:
p(A | B) = p(A) * p(B) / p(B) = p(A)
Which states that for independent random variables, the probability of A is unaffected by the fact that B occurred - which is what we'd expect.
The nice thing about conditional probability is you can turn questions on their head. Say you know how likely A is to occur given that B occurred, and you know how likely A and B are generally, then you can work out how likely B is to have occurred given that you saw A occur.
For instance, suppose A is "program does not crash" and B is "program contains bug". By running a program containing the bug many times you can figure out p(A | B), which is a measure of how frequently the bug manifests when it's present. Then, using Bayes' Theorem, you can figure out p(B | A) which is the likelihood that the program contains the bug given that the program did not crash.
Thus, you can go from knowing something about what measurements to expect under different circumstances, to knowing what different circumstances can be inferred from different measurements. This is the core concept of Bayesian inference.
Bayes' Theorem is easy to prove:
p(A | B) = p(A & B) / p(B)
p(B | A) = p(A & B) / p(A)
therefore,
p(B | A) = p(A | B) * p(B) / p(A)
Note that the theorem requires the input of p(A) and p(B). In our example, these are the likelihood that the program does not crash, and the likelihood that a specific version contains the bug.
OK, so if an event A occurs we can calculate p(B | A). What significance should we attach to this value?
Well, before A occurred, we had a likelihood for B of p(B). Now A has occurred, we have a likelihood for B of p(B | A). So, we can roll this new information (the observation of A) into our current estimate of p(B). That is, we set p(B) = p(B | A) in code. This is Bayesian refinement - we are refining our beliefs about p(B) using our observation of A. If we like Latin we might say that the original value of p(B) was the prior probability and the updated value of p(B) is the posterior or post-hoc probability. Then, by making further observations, we can refine p(B) until we are pretty sure we know whether or not B is in fact true. We can never be absolutely certain, but we can require that p(B) reach any desired level before we accept the hypothesis that B is true (or false, if p(B) becomes almost zero).
But how do we calculate p(B) to begin with? Well, given no prior information, we should just set this so that it's equally likely that each version introduced the bug. Or, we can use prior information - for instance, we could attach a low probability for a crash bug to some Javascript code written by Donald Knuth, and a high probability to some multi-threaded assembly code written by a 12 year-old who only learned to program last week.
Let's define a formal model for the situation we're dealing with in this debugging task.
We have N versions of software. One version introduces a bug. If version n contains the bug, then all versions n+1,n+2,n+3,... also contain the bug. We define a prior probability p(B[n]) which is the probability the version n introduces the bug. p(B[n]) = 1/N for all n to begin with.
We are going to run a test on version W of the software. This test will either run or crash. We will consider the result of the test to be a random variable A[W], where A[W] is true if the program crashes, and false if the program does not crash.
Now, for each version V of the software, we want to update the probability that it introduces the bug. To use Bayesian inference, we will measure A[W] so it is either true or false (i.e. we will run that version and see if it crashes). If A[W] is true we will update p(B[V]) <- p(B[V] | A[W]) and if A[W] is false we will update p(B[V]) <- p(b[V] | !A[W]).
What we have in terms of information is the probability that the program will crash if the bug is present. We call this probability p. To handle a more general case we will also assume that the program may crash anyway, even if the bug is not present. We will call this probability q. In my original problem, p is about 0.5 and q is 0. For p=1 & q=0, we have traditional deterministic binary search, so it will be interesting to see if we retrieve that special case correctly.
From this we can calculate p(A[W] | B[V]) - if V <= W, then p(A[W] | B[V]) = p (since if the bug appears in version V it affects version W) and if V > W, then p(A[W] | B[V]) = q (since if the bug appears in version V it does not affect version W). Thus we can write down p(A[W] | B[V]) for every combination of V and W.
p(A[W] | B[V]) = p when V <= W
= q when V > W
p(!A[W] | B[V]) = 1-p when V <= W
= 1-q when V > W
Finally, for input data, we also have p(B[V]) from our table.
Using Bayes' theorem, we can calculate our Bayesian update:
p(B[V] | A[W]) = p(A[W] | B[V]) * p(B[V]) / p(A[W])
p(B[V] | !A[W]) = p(!A[W] | B[V]) * p(B[V]) / p(!A[W])
The above relies on p(A[W]), which we haven't worked out yet. But since the B[V] are mutually exclusive and sum to 1, we have:
p(A[W]) = sum(V) p(A[W] | B[V]) * p(B[V])
p(!A[W]) = 1 - p(A[W])
Or, in code
float Calc_pA(int W)
{
float pA = 0;
for (int V = 0; V <= W; V++)
{
pA += p * pB[V];
}
for (int V = W + 1; V <= numBins; V++)
{
pA += q * pB[V];
}
return pA;
}
So we can get the update formula for B[V]:
if A[W] is true:
calculate p(A[W])
for earlier versions V <= W : p(B[V]) -> p.p(B[V])/p(A[W])
for later versions V > W : p(B[V]) -> q.p(B[V])/p(A[W])
if A[W] is false:
calculate p(!A[W])
for earlier versions V <= W : p(B[V]) -> (1-p).p(B[V])/p(!A[W])
for later versions V > W : p(B[V]) -> (1-q).p(B[V])/p(!A[W])
Or in code:
void UpdateTable(int W, bool didCrash)
{
float pEvent = didCrash ? Calc_pA(W) : 1 - Calc_pA(W);
float pLeft = didCrash ? p : 1-p;
float pRight = didCrash ? q : 1-q;
for (int V = 0; V <= W; V++)
{
pB[V] = pB[V] * pLeft / pEvent;
}
for (int V = W + 1; V < numBins; V++)
{
pB[V] = pB[V] * pRight / pEvent;
}
}
The initial state of the probability table has maximum entropy. Entropy is a measure of information content, or in this case, the amount of information we do not have.
Entropy is defined as:
H = -sum(p[i] * log2(p[i]))
(contribution is 0 if p[i] = 0)
This is a weighted sum of the amount of additional information that you would receive if someone told you the correct answer to the problem, for each possible answer. Each term with a low probability has a high information content, but a low weight; each term with a high probability has a low information content, but a high weight. Thus, the terms all tend to zero as we become sure of an answer. Conversely, when all the p[i] are equal to 1/N the entropy attains is maximum value, H = log2(N) (so for instance if someone told you the answer right at the start, and there were 64 possibilities, you'd receive 6 bits of information).
We want to minimize this value at each stage - the nearer H is to zero, the closer we are to solving the problem.
Suppose we are going to test version W. We can figure out the new table B[V] assuming A[W] is true, and figure out its entropy in that case. We can then figure out the new table assuming A[W] is false, and figure out its entropy too. Now, by multiplying each entropy by our probability that A[W] is true or false, we can calculate the expected entropy resulting from making measurement A[W]. This tells us how much information we expect to get from making this measurement.
It then seems reasonable to make the measurement that we expect to give us the most information.
Expected entropy is calculated as follows:
float ExpectedEntropy(int W)
{
float pA = Calc_pA(W);
float Htrue = 0;
float Hfalse = 0;
for (int V = 0; V <= W; V++)
{
Htrue += Entropy(pB[V] * p / pA);
Hfalse += Entropy(pB[V] * (1-p) / (1-pA));
}
for (int V = W + 1; V < numBins; V++)
{
Htrue += Entropy(pB[V] * q / pA);
Hfalse += Entropy(pB[V] * (1-q) / (1-pA));
}
if (pA == 0) Htrue = 0; // otherwise undefined
if ((1-pA) == 0) Hfalse = 0;
return pA * Htrue + (1 - pA) * Hfalse;
}
The total algorithm is just the following steps:
The following is a printout for p=1, q=0, N=64, bug at version 46. This is classical binary search:
Test version 31 did not crash; best guess 32 with p=0.031250; entropy 5.000000 Test version 47 ++DID++ crash; best guess 32 with p=0.062500; entropy 4.000000 Test version 39 did not crash; best guess 40 with p=0.125000; entropy 3.000000 Test version 43 did not crash; best guess 44 with p=0.250000; entropy 2.000000 Test version 45 did not crash; best guess 46 with p=0.500000; entropy 1.000000 Test version 46 ++DID++ crash; best guess 46 with p=1.000000; entropy 0.000000 SOLVED. Bug is introduced in revision 46 (correct answer 46). Probability of error = 0.000000. Number of tests = 6.
As you can see, we perform the same tests as the classical binary search algorithm (with no off-by-one errors!) The probability update reduces to setting entire segments of the array to 0, and the minimum entropy selection "magically" chooses the correct values for W at each stage, precisely reproducing the classical algorithm. Note also that the entropy reduces by exactly one bit per iteration.
If you write the code yourself it is actually slightly scary when you see this very tiny program derive an important algorithm without any other help - it's almost AI-like.
The following are results for the original problem: p=0.5, q=0, N=64, bug at version 46:
Test version 25 did not crash; best guess 26 with p=0.019608; entropy 5.927327 Test version 32 did not crash; best guess 33 with p=0.024390; entropy 5.759991 Test version 38 did not crash; best guess 39 with p=0.030303; entropy 5.536818 Test version 43 did not crash; best guess 44 with p=0.037736; entropy 5.279807 Test version 47 ++DID++ crash; best guess 44 with p=0.095238; entropy 4.785175 Test version 38 did not crash; best guess 44 with p=0.117647; entropy 4.330110 Test version 41 did not crash; best guess 44 with p=0.148148; entropy 3.870628 Test version 43 did not crash; best guess 44 with p=0.186047; entropy 3.382660 Test version 44 did not crash; best guess 45 with p=0.238806; entropy 2.963477 Test version 45 did not crash; best guess 46 with p=0.323232; entropy 2.590215 Test version 45 did not crash; best guess 46 with p=0.392638; entropy 2.147041 Test version 46 did not crash; best guess 47 with p=0.563877; entropy 1.835910 Test version 46 ++DID++ crash; best guess 46 with p=0.646465; entropy 1.943751 Test version 45 did not crash; best guess 46 with p=0.785276; entropy 1.361765 Test version 45 did not crash; best guess 46 with p=0.879725; entropy 0.872590 Test version 45 did not crash; best guess 46 with p=0.936015; entropy 0.525242 Test version 45 did not crash; best guess 46 with p=0.966950; entropy 0.303562 Test version 45 did not crash; best guess 46 with p=0.983197; entropy 0.170931 Test version 45 did not crash; best guess 46 with p=0.991527; entropy 0.094610 Test version 45 did not crash; best guess 46 with p=0.995746; entropy 0.051748 Test version 45 did not crash; best guess 46 with p=0.997868; entropy 0.028057 Test version 45 did not crash; best guess 46 with p=0.998933; entropy 0.015110 Test version 45 did not crash; best guess 46 with p=0.999466; entropy 0.008092 SOLVED. Bug is introduced in revision 46 (correct answer 46). Probability of error = 0.000534. Number of tests = 23.
As you can see, the algorithm make a first choice at bin 25 (for 64 bins). It is hedging - it knows that if it sees the bug in an earlier revision it can eliminate huge numbers of bins, but it is balancing this fact against the unlikelihood of the bug being in an earlier version.
The results here are also quite AI-like - we can infer "reasoning" behind the scenes but in fact it's just churning through a simple array using a simple Bayesian update. But that update contains all the known information about the system, so it is impossible to beat with more conventional "thought" on the matter; and any "thought" we have on the matter, if correct, is necessarily reflected in the output of our algorithm.
So imagine what more complicated Bayesian systems can do ...
In reality, it takes longer to get a version than to not get a version, and the further away the version is from current, the longer it takes. Also, it takes longer to compile a new version than to not compile a new version. Overall, it is much cheaper to find A[W] repeatedly than to keep changing versions and testing (especially if you are running the tests by hand!) and it is cheaper to go to A[W+1] than A[W+10].
So, if we could quantify this, and estimate a cost per W, then we could perhaps divide the entropy reduction by the cost to give us a figure of "entropy reduction per buck". Then we would pick the best "value" choice for W. It would be interesting to see what decisions the algorithm makes in this case.
Another possible improvement is this: what if p and q are not known? By looking at the experiments it should be possible to derive estimates for p and q as you go along. I'm not sure how best to approach this : by having "bins" for p and q with a probability per bin, or by having a central estimate and variance for p and q which is updated as time progresses. Any thoughts?
This was my first Bayesian algorithm, and it was a fun journey. It seems that such algorithms have great potential for problems involving uncertainty. I should note that this algorithm is not novel - I have reinvented "Bayesian Binary Search" - Google that term to find other references.
Although I've never previously found statistics to be that fascinating, I must say my interest is now developing. I'm curious to know what else Bayes' Theorem can do!