Bayes’ Theorem was conceived by the Reverend Thomas Bayes, an 18th-century British statistician and philosopher who attempted to articulate how to make predictions based on our changing beliefs. Bayes’ Theorem is a machine that converts evidence in predictions. To understand it, we will need to be able to read its notation.
Here’s how to read Bayesian notation:
When you see P(A)
, that means “the probability that A is true.”
When you see P(A|B)
, that means “the probability that A is true given that B is true.”
In this case, it’s easiest to think of B as the symptom and A as the disease; i.e. B is a skin rash that includes tiny white spots, and A is the probability of the measles. So we use easily visible (or perceiveable) events to calculate the probability of hidden events. You see one thing, and want to predict what you can’t see.
It turns out that the the probabilities of A and B are related to each other in the following manner:
That is Bayes Theorem: that you can use the probability of one thing to predict the probability of another thing. But Bayes Theorem is not a static thing. It’s a machine that you crank to make better and better predictions as new evidence surfaces.
An interesting exercise is to twiddle the variables by assigning different speculative values to P(B) or P(A) and consider their logical impact on P(A | B). |
For example, if you increase the denominator P(B)
on the right, then P(A|B)
goes down. Concrete example: A runny nose is a symptom of the measles, but runny noses are far more common than skin rashes with tiny white spots. That is, if you choose P(B)
where B is a runny nose, then the frequency of runny noses in the general population decreases the chance that runny nose is a sign of measles. The probability of a measles diagnosis goes down with regard to symptoms that become increasingly common; those symptoms are not strong indicators.
Likewise, as measles become more common and P(A)
goes up in the numerator on the right, P(A|B)
goes up necessarily, because the measles are just generally more likely regardless of the symptom that you consider.
Maybe you’ve played a party game called Werewolf.^{1}
It’s what they call a “hidden role” game, because roles are assigned to you and your fellow players, but nobody knows what anyone else is.^{2} You take a group of, say, 10 players and divide them into two roles – werewolves and villagers. Everyone draws slips of paper from a hat, closes their eyes, and then the werewolves are instructed to wake up to know who’s on the werewolf team.
Since only the werewolves know that they are werewolves, everyone in the game claims to be a villager. As the game moves through artificial cycles of day and night, each night the werewolves wake up and kill one villager, and each day the villagers try to figure out who the werewolves are, and vote to lynch one player based on their suspicians. Good, clean fun.
So a great way to learn Bayes Theorem, and in particular the implementation called Naive Bayes, is to think about how you track down a werewolf. So A
is the variable you are trying to predict: It is a categorical variable that can be either A1 == werewolf
or A2 == villager
. Your job in this game is to act as a binomial classifier. B
would be all the symptoms of werewolfiness, which in this case mean the symptoms of deception, since the werewolves only win by hiding their identity from the villagers as they slowly kill those poor people off.
The moderator of the game tells you how many werewolves there are, that’s P(A1)
, but doesn’t tell you who. Over the course of your life, you’ve learned that people do certain things when they’re nervous, lying to you and afraid of getting caught.
So you start looking for symptoms you might use to diagnose a werewolf: Are they shifting in their seat? p(B1=Shifting|A1)
Are they avoiding eye contact? p(B2=AvoidEyeContact|A1)
Are they touching their face or hair with their fingers? p(B3=ActiveFingers|A1)
That is, instead of using the symptoms to diagnose the disease P(Measles|Rash)
, you’re flipping it around, using an assumed diagnosis to predict the probability of symptoms. This is the generative side of Bayes.
What is the probability that you exhibit a combination of these symptoms if you are a werewolf? It’s P(B1, B2, B3 | A1)
.
And so your prediction that someone is a werewolf or a villager would be: Prediction = argmax_A[P(B1, B2, B3| A) * P(A)]
for all A.
The only problem is that, as the number of features you analyze increases (i.e. B grows larger), P(B1, B2, B3,
B4 | A)
gets very hard to compute without a large sample. You would have to see each, exact combination of features many times to calculate their probilities. That’s where the “naive” part of Naive Bayes comes in: Naive Bayes assumes all inputs are conditionally independent.
Instead of trying to find the probability of B1, B2, B3 etc. occurring simultaneously for a certain label, you calculate the probability of each feature occurring individually given that label: P(B1, B2, B3, B4| A1) = P(B1|A1) * P(B2|A1) * P(B3|A1) * P(B4|A1)
We assume their combinations don’t matter. Each symptom has a relationship with the probability of werewolf, but combinations of those symptoms do not increase that probability. Why do we call that assumption “naive”?
The features avoiding eye contact
and usually friendly
on their own might not tell you a lot about whether the person in question is a werewolf. P(B1|A1) * P(B2|A1)
But if you know simultaneously that a person who is usually friendly has suddenly fallen quiet and is avoiding all eye contact, common sense dictates that your expectation of that person being a werewolf should increase. P(B1, B2|A1)
.
So Naive Bayes discards information in favor of computational efficiency, a tradeoff we’re forced to make with other algorithms like convolutional networks as well.
Bayesian networks are graphical models that use Bayesian inference to compute probability. They model conditional dependence and causation. In a Baysian Network, each edge represents a conditional dependency, while each node is a unique variable. Bayesian networks were invented by Judea Pearl in 1985. Bayesian Networks were a particularly popular approach to machine learning problems in the 1990s.
Bayesian Network can be used to model any number of causal relationships. Instead of tackling werewolves and the symptoms of deceipt, you could try to model all the possible causes of your front lawn being wet, which would include rain and sprinklers and leaky fire hydrants and kids with squirt guns (this scenario is, in fact, the hello world of Bayesian Networks).
1) Thanks to Andy Seow for this scenario.
2) It’s interesting to consider that medical doctors are forced to play a “hidden role” game with disease, which doesn’t announce itself by name but only indirectly by symptoms. And indeed, most of the most important things that happen in the world don’t happen when you’re there, but rather in your absence, distant in space and time, and you are left to diagnose them by their visible repercussions. As one wag put it: “Space is the reason why everything doesn’t happen to you.”