The Centrality of Bayes’s Theorem for Machine Learning

It’s hard to overstate just how important Bayes’s Theorem — something that Thomas Bayes, English minister and mathematician, came up with in the 1700s — is for making sense of machine learning. But understanding the theorem itself is non-trivial and challenges our intuitions. Here’s a brief introduction to why we need Bayes and its import for ML.

**WHY MACHINES LEARN**

**“A masterpiece”— Geoffery Hinton**

Consider the following problem:

You take a test for a disease that occurs in about 1 in 1,000 people. Let’s say the test is 90% accurate: positive 9 out of 10 times when person has the disease (the sensitivity of the test) and negative 9 out of 10 times when the person doesn’t have the disease (the specificity of the test)
The test is positive. What’s the chance you have the disease, assuming you have been picked at random from the population?

Intuitively, most of us will say: well, the test is 90% accurate, so there’s a 90% chance I have the disease. We’d be wrong. We need to take into account some prior knowledge, such as the prevalence of the disease in the general population. We need Bayes’s theorem.

We have to calculate the probability of the hypothesis H (you have the disease), given the evidence E (the test is true).

P (H | E ) = (P (H) x P (E | H)) / P (E)

Where:

P (H): The probability of the hypothesis being true, without any evidence. This is the prior.
P (E | H): The probability of the evidence, i.e. the test being positive, given you have the disease. This is the sensitivity of the test.
P (E): The probability that the test is positive = prob of having disease x sensitivity + prob of not having disease x (1 – specificity)

If you plug in all the numbers, you’ll get 0.0089, meaning there is a 0.89% chance you have the disease! That’s very, very different from the 90% our intuition suggested.

Tweak the prevalence rate, the sensitivity and the specificity of the test and this chance will increase or decrease.

Why is Bayes’s Theorem important for Machine Learning? Let’s say you are trying to classify a patient as being at risk, or not-at-risk, given some data, x.

We need to calculate the probability of y = “at-risk” given x, and the probability of y = “not-at-risk” given x.

P (y = “at-risk” | x) and P (y = “not-at-risk” | x)

Calculating these probabilities requires Bayes’s theorem (given some assumptions of the underlying distribution of data). The classifier will pick the class with the higher probability. Of course, because these are probabilities, the prediction can be wrong. So, there’s a certain risk of making an error. Turns out that this is the lowest possible risk of error: the classifier, called the Bayes Optimal Classifier, is the best an ML algorithm can do. Any other method can only come close to this risk, but never improve upon it.

Understanding this viscerally is key to appreciating the power of Probability (and Statistics, of course) in Machine Learning. All of this and more is explained in WHY MACHINES LEARN.

The Centrality of Bayes’s Theorem for Machine Learning

Whose Fault Is it Anyway? The Problem of Credit Assignment