When to use Binomial versus Beta distribution?

And what is the point of probability distributions anyway?

Tarek Amr
4 min readSep 6, 2022

--

A footballer is known to score 70% of the penalty kicks he shoots. In the next season we expect him to shoot 10 penalty kicks, how many of them will he score?

He will score 7 out of the 10 penalty kicks, obviously!

Actually, 10 penalty kicks is a very little number to make a definite conclusion from. This obvious 7 could turn up to be 8 or 9 with a bit of luck, or he can miss a couple of unexpected penalties and the 7 turns out to be 5. Obviously, huh?

With such a small number, there are hardly any obvious answers, we rather need to express our belief in the form of a distribution.

And in this case, it is a binomial distribution that we need.

Image create by the author

This is what the above distribution represents:

Say we manage to convince this football player to shoot 10 penalty kicks, then ask him to shoot another 10, then another, up to 1,000 sets of 10 penalty kicks. Each time we calculate how many shots out of the 10 he scored, and create a histogram out of it. That’s what we have here.

Keep in mind, the player can never score 7.7 penalty kicks, or 3.8 kicks, only integers are allowed on the x axis. That’s why the binomial distribution is discrete probability distribution, and this histogram is called probability mass function.

Rather than bothering you with the confusing statistical terms, like experiment, event and success, etc. Let’s use the vocabulary of our example here to explain how the binomial distribution can be used.

When you know the probability (p) of a player scoring a penalty kick. Then you use the binomial distribution to express your belief in how likely they will score (x) penalty kicks out of (k) kicks. As you can see, there are two parameters here, the probability (p) and the number of kicks the player will shoot (k), and from there you can plot the probability mass function in terms of the number of scored kicks (x).

As mentioned, because 10 is a very small number we cannot be confident that the player will score 7 out of those 10 kicks. You can even see in the graph above that there is about 10% chance that they will score 5 kicks only.

But if k is a bigger number, say the player shoots 100 penalty kicks, don’t you think it is very unlikely that they will score only 50 out if them?

Image create by the author

What’s the point of statistics anyway?

As you can see above, if we have a player who will shoot 10,000 penalty kicks in a season, and we know that their accuracy is 70%, we probably don’t need to bother with all these distribution complications, we can safely assume that they will score 7,000 kicks out of those 10,000 kicks, and worst case scenario we will be off by 1 or 2% (~ 100 / 7,000).

The main point of statistics is when dealing with few data, and we cannot give definite answers because things may vary a lot, as in the case of the 10 kicks, where we may be off by 40%.

The beta distribution

Now imagine asking a different question. This time we do not know the player’s accuracy, but we know that he went to shoot 10 kicks and scored 7 of them. What is the player’s accuracy?

His accuracy is 70%, obviously!

Come on! Didn’t we just agree that 10 is such a small number, and we shouldn’t jump to conclusions?!

Once again we need a distribution to represent our belief in the player’s accuracy, and yes, you guessed it, it is the beta distribution that we are looking for. And it takes two parameters, a and b, which are number of success (a = 7 penalties scored here) and failures (b = 3 missed).

Image create by the author

Notice, probabilities can take any values between zero and one. So, unlike the binomial’s probability mass function, we can have fractions on the x-axis here, that’s why the beta distribution is continuous probability distribution, and this graph is called probability density function.

Clearly, those two distributions are very useful, we can think of plenty of daily problems where we either need to express our belief in number of successes, given that we know the success rate, or the other way round. Furthermore, the two distribution accompany each other in Bayesian inference, since they happen to be conjugate priors, but that’s something for a future post.

--

--

Tarek Amr
Tarek Amr

Written by Tarek Amr

I write about what machines can learn from data, what humans can learn from machines, and what businesses can learn from all three.

Responses (2)