In the covid origins debate, Bayesian computations featured prominently as both sides argued their cases at least in part using the language of Bayes factors and Bayesian updating.
The most important question I asked at the debate is whether Bayesian reasoning is a valid approach for resolving questions; specifically, whether it is possible to get the wrong conclusion at the end of a Bayesian argument even if all the numbers that went into your computation are correct. (For more on this topic, see section 3 of my report (pdf).)
However before we can understand the potential weaknesses of Bayesian reasoning or how it can go wrong, we need to understand what it is and how to compute Bayesian updates, which is the goal of this article; while we are here, we also discuss the closely related and famously confusing method of hypothesis testing and p-values. There is a common sentiment that Bayesian reasoning is the “right” way to make deductions or that Bayesian reasoning is “better” than hypothesis testing; this is the same kind of category error as a statement like “cheese is tastier than food”. We will discuss how they relate to each other and why primary science research almost always uses hypothesis testing.
(If you are uninterested in hypothesis testing, you can skip to the third section.)
We are all familiar with the most basic step in logical deducation,
modus ponens:
if is true, and
implies
, then
is true.1Taking
what is intuitively obvious and dressing it up in formal language may
seem counterproductive when working with toy examples, but it has the
advantage of making the statements amenable to algrebaic
manipulation. Algebraic manipulation of formal symbols is much
faster and more reliable when the task at hand is too large to hold in
one’s head. Logical deduction is useful because it allows
for chaining: if
implies
, and
implies
, then
implies
.
The next most basic form of logical deduction is that implies
is equivalent to its contrapositive,
(i.e., “not B”) implies
. I suspect that some people find this less
intuitive because it does not work especially well under uncertainty in
the real world. Suppose, for example, we want to test the claim that all ravens are
black. We could build evidence for this claim by observing ravens,
and verifying that they are black. However that sounds like a lot of
work, so how about an easier plan: observe that “all ravens are black”
is logically equivalent to “all non-black things are not ravens”. We
could build evidence for this completely equivalent claim by observing
non-black things, and verifying they are not ravens! And, we can do that
without even having to go outside and risk encountering an actual raven
(hence this is sometimes referred to as “indoor ornithology”).
This is only the beginning of our difficulties as we stray further from pure mathematics. In the real world, we can never be fully certain of any statement, so the preconditions of modus ponens will never apply. The most common resolution is to augment each statement with a probability, representing our confidence in its truth. What if we try logical deduction now?
Attempting to perform deduction of probably-true statements in the
same way that we did above immediately falls on its face. The root issue
is that it is impossible to perform inference along chains: if implies probably-
, and
implies probably-
, we cannot conclude that
implies probably-
! (Similarly, probably-
and
implies probably-
does not imply probably-
.2In the
terminology of functional programming, this is to say “probably-” is not
a monad,
because for a monad
you can combine
and
to make
(this is the “bind” operation), and you can
combine
and
to make
(this is lifted bind). Equivalently, if
“probably-” were a monad, then “join” would be
, which is to say
probably-probably-
implies probably-
, which is false. The failure of the monadic laws
is what makes probabilistic inference like this useless.)
If we were to chain together probably-true inferences, our uncertainty
compounds until we are left with nothing.
To make discussion easier, let us write to mean “
is probably true” (think of it as saying that
is true-ish), and
to mean “
implies
”. As usual,
means “
or
”,
means “
and
”, and
means “not
”. For example, one can verify each of the
following:3The monadic laws for
would say
and
,
both of which are false.
Note that the first statement above can be thought of us the
definition of , the second is the contrapositive, and
the third is one of the axioms of
. Before we move on, let us prove the last two
statements. To show the fourth statement, we have
where we have used the definition of three times and basic properties of
and
.
For the fifth statement, we have , and
likewise
, so by the fourth
statement the conclusion follows. Note that the converse (i.e., flip the
direction of
) is false: it is possible for
to be probably-true, but neither
nor
is probably-true.
Let us return again to the contrapositive, which says that if then
. We would like to use
to make a similar statement, that if
then
. This is false
(e.g., if
and
are true, but not
nor
, then the antecedent is true but the
conclusion is false4You
may recall that in ordinary logic we can test the validity of statements
using truth-tables: that is, tabulate every combination of possible
values of true and false for each variable, and then verify the claim
simplifies to “true” for each combination. We could do something
similar, but now each variable has three possible values: true,
probably-true but false, and definitely false. Indeed this technique
successfully disproved the statement here. However I am not sure this
could be used in general, because of the existence of expressions like
and
and
which cannot be simplified into
depending only on the values of
and
.). However, a weaker form of the
contrapositive is true: if
, then
. Let us prove
that. First, since
implies
, therefore
implies
. Therefore
where we have used the various properties from the list of examples above.
How is this useful? Suppose we have some item whose truth is of interest, but we cannot
directly observe
. Instead
has some consequence,
, which is observable. If we observe
, then we can conclude
.
Of course the real world is rarely so generous as to have certainty,
so rather than we only know the weaker consequence
. Now from
we cannot conclude
, but we can probably conclude
, as we have
. This is not to
be confused with definitely concluding
, which would be
.
This is exactly the formula used in hypothesis testing. We have some
unobservable claim, , called the null hypothesis. To test
, we build a statement of the form
; this process will be
explained in more detail shortly. We then perform an experiment to
measure
. If we observe
, then we can probably conclude
: we reject the null hypothesis at some
confidence level, representing a false negative rate.
We see from this the two main sources of confusion that arise in
hypothesis testing. First, to make this deduction we must observe . If we observe
then we can make no deduction (we “fail to reject
the null hypothesis”), just as if
and
then we gain no information about
. Second, even if we do observe
, we do not conclude that
is probably false; instead we probably
conclude that
is false. These sound very similar but this is
the distinction between
, which is true,
and
, which is false.
It is common for people to mistakenly interpret the p-value as the
probability that
is true; but instead it is the probability of a
false negative if
is true.
Time for the messy bit: how do we construct a statement of the form
? That is, we need to make a
choice for
. As an example, suppose we pick a random number
from 0 to 1, and say that
is the observation that the random number is
bigger than 0.05. Then
where
means “with 95% confidence”. If we observe
, that the random number is smaller than
0.05, we can reject
at a 95% confidence level.5Obviously you should choose the desired
confidence level for your application, and not blindly follow the
traditional choice of 5% error rate, but for purpose of our discussion
it is fine.
As an example, let us say that is the hypothesis that a certain population has a
mean height of 2m and standard deviation of 0.1m. Using the choice for
above, then
is true where
means 95% confidence. If we perform the
experiment we either observe
or
. If the former, we deduce nothing; if the
latter, we have
at 95%
confidence, which we express as “rejecting” the null hypothesis
.
Obviously this is not a great choice of , as it has no relation to the underlying claim
that we want to test. Our approach has a 5% false
negative rate: there is a 5% chance of us rejecting
even if it is true. We also have a 5%
true negative rate, and have a 5% chance of correctly rejecting
if it is false. Our goal then is two-fold: we
want to choose a
such that
according to some desired
false negative rate, and conditional on maintaining that false negative
rate maximize the probability of observing
if
.
The task of choosing such a is usually broken up into 3 steps: choosing a
test statistic, calculating a p-value, and choosing a
false error rate (the threshold for the p-value). Frequently there is an
obvious choice for
that is the mathematically best way to
distinguish the possibilities
and
, for a certain collection of observations;
other times it is less obvious what the optimal choice is, or if there
even is one, and we just have to pick something that seems good.
Different choices for
correspond to different statistical tests; one is
not more correct than another, but one may have greater statistical
power for the same significance threshold, and therefore be more
likely to reach a useful conclusion for a given dataset.
Using again that is the claim that a population has a mean height
of 2m and standard deviation of 0.1m, we will make a more sensible
choice for
. To have any chance of progress, we need
observations that relate to
in some way, so let us suppose we have access to
a random sample of members of that population, and can measure their
heights. Thus we have as data some collection of
numbers
representing these heights; these
are random
variables,6Formally a “random variable” means “a value
randomly sampled from a probability distribution”. in that
each time we perform the experiment we may get different values for the
, and we suppose they are iid.
Depending on our application, we might have much more complicated
observations, including non-numerical data.
A test
statistic is any real-valued function of the observations.7Well,
technically that is the definition of a statistic. A test
statistic is a statistic that is useful in hypothesis testing. Also it
doesn’t really have to be real-valued; it just needs some
ordering. We want to choose a test statistic that is
informative about whether is true or false. In this case, the obviously
best test statistic is the sample mean:
Next, we calculate a p-value. A test statistic is a random variable
(because it is a function of random variables) and therefore follows
some distribution. We want to choose our test statistic in such a way
that if is true, we know the distribution the test
statistic follows: in this case,
is approximately
normally distributed with mean 2m and standard deviation
. The p-value is found by
normalizing a test statistic to be uniformly distributed in the range 0
to 1, assuming
is true. We do this by computing the cdf
of the test statistic and applying it to the observed value of the test
statistic; that is, we compute the percentile of the test statistic in
its distribution. (We will discuss one-tailed vs two-tailed tests
later.)
Finally, we choose a threshold such as 0.05, and is the observation that the p-value is greater
than 0.05.
Thus, if is true, the p-value is a random variable
uniformly distributed in the interval 0 to 1, and therefore there is a
95% chance of observing
. That is,
, from which we deduce
, as desired: our
test has a 5% false negative rate. However now we have improved on our
previous choice of
. If
is false, the distribution of
is not what we calculated
above, and therefore the p-value is not uniformly distributed
from 0 to 1. Hopefully the p-value strongly favors values below 0.05 –
if so, we will have a high true negative rate (and so low false positive
rate).8You are welcome to, say, suppose the population
has a height mean of 1.5m and calculate the distribution of
and
under that assumption; if
is large enough, the p-value will be strongly
weighted towards 0.
While historically hypothesis testing was intended to give a reject/fail-to-reject binary outcome, as described above, in practice one usually reports p-values directly.
So far we have been staying close to propostional logic, with a bit
of uncertainty mixed in. As we change course to look at Bayesian
reasoning, we will fully accept the probabilistic nature of our
computations. From now on we are using the letters ,
, …, to represent events, with the symbol
to mean the probability of the event
.
Suppose you have two exclusive events9In
probability theory, an event is a statement which has a
probability of being true. We assume events follow certain rules, such
as if and
are events then there is an event called
, representing that both
and
are true, satisfying
.
and
you wish to distinguish. Unfortunately we
cannot observe them directly; instead we make some sequence of other
observations
.
If one of our observations were, say, incompatible with
, then we would be done; by simple deductive
reasoning we could say “
implies
;
; therefore
”. Instead each of our observations are
consistent with either
or
– but, critically, not equally so.
We measure the degree of consistency with the conditional
probabilities10The
symbol
, read “(the conditional) probability of
given
” is defined as the ratio
.
and
. From this information it is a simple
application of Bayes’ Theorem
to compute
, and repeated application to get
.
There is nothing deeper to performing Bayesian computation than repeated use of Bayes’ Theorem, but with a little organization we can gain better understanding of the process and be less likely to make mistakes. First, we can build a table of the information we know:
prior | ||
---|---|---|
observation 1 | ||
observation 2 | ![]() |
![]() |
observation 3 | ![]() |
![]() |
Because we will be interested in the event that all of the
observations are jointly true , rather than only one or
another of them being true, we have to use the probability of each
observation conditioned on the previous ones also happening. If our
observations are independent of each other then this is unnecessary, as
in this case.
Next, let us use the definition of conditional probability:
Thus, the cumulative products we get by just multiplying down the columns of the previous table are the joint probabilities:
- | 1 | 1 |
prior | ||
observation 1 | ![]() |
![]() |
observation 2 | ![]() |
![]() |
observation 3 | ![]() |
![]() |
(Note that mathematically there is nothing distinguishing the prior probability from any of our observations; the choice of what information to call “prior” versus “observation” is just a matter of convention. You can think of prior as the “null” or trivial observation: how much you should update your probabilities based on not making any observations at all.)
We should interpret each row as telling us the relative
probability of either of the hypotheticals jointly with the cumulative observations.
Initially, before making any observations, the prior probabilities
and
sum to 1 but as we apply successive
observations the sum of each row will decrease and become smaller than
1. This residue probability (ie, the amount by which each row sums to
less than 1) represents the chance of some hypothetical alternative in
which at least one of the observations didn’t happen.
Ok, what good has this done us? Well, our goal is to find the
conditional probability . This is just equal
to the fraction of the
th row that is in the first column:
Indeed, all that matters is the ratio of the two columns:
Since the columns of the second table were found by just multiplying the columns of the first table, the ratio of the columns of the second table are just the product of the ratio of the columns of the first table:
Thus we began with eight11Actually seven, because we knew , so the first row only had one data
point. pieces of data in the first table but it turns out
that all we needed was four pieces of data, supposing we can directly
measure these ratios.
Indeed frequently the ratio , called the Bayes
factor, is easier to measure than either
or
separately; in messy, real-world
scenarios the probability of the event
might be wrapped up with many uncertain factors
that have nothing to do with either
or
. Estimating
requires assessing these irrelevant
factors, but estimating the Bayes factor does not.
In extremes where probabilities get very close to 0 or 1 we can
simplify matters further by taking logarithms everywhere – why multiply
when instead you can add? We then have logarithmic Bayes factors where
is the information contained in the
first observation, with positive numbers informing us in favor of event
, and negative numbers informing us against it. If
the logarithm is base 2, then
has units of bits. Adding up each of our pieces
of information gives
(Given a new piece of information
, we can simply add it to the sum we have
so far; this is called Bayesian updating.)
This result is also called the (conditional) log-odds of
.12The
log-odds of
is defined as
, but here we have conditioned on the
observations
. Log-odds
can in some situations be more intuitive than ordinary probability,
especially for extreme probabilities. A log-odds of 0 means a
probability of 50%, and positive log-odds means an event that is more
likely to happen than not. From the (conditional) log-odds of
we can compute the ordinary (conditional)
probability:
How does this work in practice? Suppose some event of interest has some probability of being true; start by
computing the unconditional (ie, prior) log-odds of
. We make a series of observations, and assess for
each observation how much information it provides in favor of
versus against it. We update our log-odds of
by adding to it the information (positive or
negative) from each observation. The result is the updated (ie,
conditional on the observations) log-odds for
. Alternatively, if we don’t want to work with
logarithms, instead of adding up log-odds we can directly multiply
probabilities.
Let us work an example. Suppose you are hiring an engineer, and you
want to know their competency at a particular skill. You make three independent
observations: they did adequately on an interview assessing that skill,
they have an obscure certificate for that skill, and they went to
University of Example which has a good engineering program. Most
candidates, whether competent or not, do not have that certificate nor
went to that university, so it is hard to assess the probability of
those observations; but the relative probabilities, or Bayes
factors, are easier to guess at. We have
![]() |
Bayes | log-odds | ||
---|---|---|---|---|
prior | 0.1 | 0.9 | 0.111 | -2.2 |
interview | 0.4 | 0.1 | 4 | 1.39 |
certificate | - | - | 1.2 | 0.18 |
UoE grad | - | - | 2 | 0.69 |
total | - | - | 1.0667 | 0.065 |
Adding up the last column we get a log-odds of 0.065, or a probability of 51.6%, just barely above even odds that the candidate has this particular skill.
This example also illustrates a second important principle to understand with Bayesian reasoning: it can applied to any situation, and always gives an answer, regardless of how appropriate the technique is for the application. I certainly hope no one involved in hiring candidates is using a calculation of this nature to help make that decision, or at least not with the same lack of care as I did above. One must pay close attention to potential problems: did you account for all the available evidence? how accurate are your probabilities? how robust is your result to changes in the data? The messier and more “real-world” your situation, the easier it is to run afoul.
The short answer is very simple: the conditional probability of making an observation conditional on
the null hypothesis
is the p-value of that observation.13Following a conversation
with Michael Weissman in which he disagreed with this statement, I
should clarify that this is only for a binary observation. More
generally one should say that the conditional probability
is related to the p-value, with
the nature of that relationship depending on the definitions chosen for
a particular application.
In Bayesian reasoning we start with a prior probability , a conditional probability
, and the complementary pieces of
information
14Which
of course is redundant with
, since
and
, and use these to compute an
update:
In hypothesis testing, the only piece of data we have is the p-value
. We can’t compute
because we don’t have the other required
pieces of data.
If we have all three pieces of information then it makes sense to go ahead
and compute the updated probability
; however if we do not have access to those
last two numbers, or their accuracy is very low, it can be more useful
to directly report a p-value
and not attempt to compute
.
In primary scientific research, those two numbers are often inaccessible due to two features of the null hypothesis: the null hypothesis being non-probabilistic, and being very narrow or asymmetric.
Suppose is a statement like “it will rain tomorrow”, and
we want to estimate the probability of tomorrow’s weather based on
observing today’s weather. It makes a lot of sense to start with a prior
probability
(based on historic rainfall frequency) and
update it appopriately. But if
is a statement like “neutrinos are massless” or
“fracking does not influence earthquakes” then it is not meaningful to
speak of probabilities like
or
.15Note
that
is still meaningful, so long as the
observation is probabilistic in nature. We could interpret
to mean our level of confidence in
, i.e., as information content, but this often has
more to do with the mental state of the researcher than with
.
Second, Bayes theorem is fully symmetric between and
, but frequently in scientific research these
hypotheses are not symmetric. The classic example is testing if a coin
is fair: suppose we observe 30 heads in 100 coin flips (iid), and we
want to test the null hypothesis
that the coin is fair. Here
is a very narrow and specific claim that lets us
easily compute a p-value
. The negative
is unspecific: the coin has some
bias, but it could be any nonzero amount. The probability
depends on how biased the coin is, so
we need additional information like a probability distribution for the
amount of bias. This is a lot to ask for when we don’t even know if the
coin is biased at all!
Finally, even if we had this extra data and could compute , frequently that is less useful
than reporting raw p-values. Suppose you are doing secondary research,
and want to estimate
; you find that primary research into the
effects of
has identified a series of unrelated observations
that give information about
. Each of these observations has been made by
different research teams with different specialities. If each team
reported their own estimate for
, it would be a troublesome and
error-prone process to combine these estimates into a single value for
: each team will have a
different estimate for
with different assumptions; each team would
incorporate different evidence (perhaps the researchers investigating
the
phenomenon were unaware of the existence of
and omitted it entirely; or the
researchers incorporated some other evidence
that was later found to be unreliable). For you
to compute
would require first undoing all the
computations that each team did to reconstruct the underlying p-values
; much simpler if they just reported
these p-values directly.
The reason primary researchers report p-values is that this is
usually the natural end point of their research; synthesizing the
p-values of many different observations into a single posterior
probability is the job of secondary research. Each probability might involve a completely
different physical process and speciality, so it is most suitable to
have each term investigated separately by experts in the appropriate
field.
(Todo; might add some explanatory text on how to convert between bayes factors and p-values)
Recall from the first section that the hypothesis testing method involves three steps:
We slightly glossed over the second step, giving one way to normalize the test statistic by applying its cdf.
Choosing the normalization method is every bit as important as choosing the test statistic (though usually obvious once the test statistic has been chosen); as with the choice of test statistic, any choice is valid so long as the result is uniformly distributed in the range [0, 1], but not every choice might have the same statistical power (i.e., false positive rate).
Recall that we want the p-value to be as low as possible when
conditioned on ; this maximizes the chance of a correct
rejection of
, since we reject when the p-value is below the
threshold. Therefore when normalizing, we first sort all possible values
of the test statistic by how likely they are under the condition of
. Thus, the most likely outcomes will have
the lowest p-values.
Slightly more carefully, what we are sorting by is how
informative each possible value is in favor of over
; that is, we are sorting by the Bayes factors
.
For example, suppose our null hypothesis is that a coin is fair, and we observe
that the coin had 30 heads out of 100 flips. Our
test statistic is the number of heads, which when conditioned on
is approximately normally distributed with a mean
of 50 and standard deviation of 5. We can normalize a normal
distribution into a uniform distribution by applying the cdf; then 0
heads gives a p-value of 0, 50 heads gives a p-value of 0.5, and 100
heads gives a p-value of 1. What is the p-value of 30 heads? This is 4
standard deviations below the mean, which we can look up in a standard
normal table is a percentile of 0.00003; that is our p-value, and we
can feel confident in rejecting the null hypothesis that the coin is
fair.
While this worked okay, depending on the application this was not the
best choice of normalization. If instead we observe 70 heads out of 100,
we would have gotten a p-value of 0.99997, and we would fail to reject
the null hypothesis even though we know intuitively that the observation
contains enough information to do so. We would have done better if we
had sorted the possible observations by how informative they are in
rejecting the null hypothesis. Here finding 0 or 100 heads is the most
informative, so they get the lowest p-values, followed by 1 or 99 heads,
then 2 or 98 heads, and so on, ending with 50 heads getting assigned a
p-value of 1. As before, this is done so that the result is uniformly
distributed in the range 0 to 100. Now if we observe 70 heads, this has
a higher p-value than any observation of 0 to 29 or 71 to 100 heads, but
a lower p-value than 31 to 69 heads; therefore it gets a p-value of
0.00006,16The total probability of seeing 0 to 29 or 71
to 100 heads, conditioned on , is 0.00006. again comfortably
rejecting the null hypothesis that the coin is fair.
As we are sorting the observations by their Bayes factors , this sorting depends on the choice of
alternate hypothesis
. If
is that the coin has any nonzero bias, then
the sorting method we just used is appropriate, and is called a
two-tailed test; but if
is more specifically that the coin is biased
in favor of tails, then the observation of 70 heads out of 100 does not
significantly favor either the null or alternative hypotheses, and so
gets the p-value of 0.99997 we had originally calculated. This is the
one-tailed test. For example, when testing a cancer medication
in rats, our null hypothesis is that it has no effect, and our alternate
is that it reduces cancer rate, so a one-tailed test is appropriate in
that we would not draw any conclusions from an observation of it
increasing cancer rates.17And
also in that rats, unlike certain biased coins, have one
tail. This was an actual question I got from a cancer
researcher who knew how to perform the calculations for a one-tailed and
two-tailed t-test but not which one was appropriate, and only one of the
results was “significant”.18Apparently the group’s statistician was on
vacation at the time; why the researcher thought to ask a 19-year old
kid from a foreign country is unclear. Probably the better
answer would have been that the magical 5% confidence threshold is
arbitrary and no great import should be assigned to whether your results
fall above or below that line.
(A little subtlety: how do we sort by the Bayes factors if we can’t
compute them without choosing a specific bias for ? For the one-tailed coin test, it does not
matter; the sorting is the same for any possible bias even
though the actual values are not. For two-tailed, we have to assume that
bias in favor of heads is equally likely as bias in favor of tails, and
maybe some further assumptions.)
While frequently choosing how to normalize the test statistic amounts to simply choosing between a one-tailed or two-tailed test, in principle it could be any possible normalization scheme: maybe the two tails could be weighted differently, or the sorting goes from inside out, or even numbers come before odd numbers, etc etc. The test statistic doesn’t even have to be a number – all that is required is that it can be sorted by the Bayes factors.
(Addendum. I couldn’t find a decent splash image for this post online, so I asked a bot to draw a picture of “hypothesis testing” and ended up with this mess.)
Follow RSS/Atom feed for updates.