Marginal, Joint and conditional probabilities

When it comes to probability, the two fundamental rules that everyone needs to remember are 1) Sum rule and 2) Product rule, mathematically the can be written as,

As we can see in the formula, there are basically three terms, p(X, Y ) is a joint probability and is verbalized as “the probability of X and Y ”. Similarly, the quantity p(Y |X) is a conditional probability and is verbalized as “the probability of Y given X”, whereas the quantity p(X) is a marginal probability and is simply “the probability of X”. These two simple rules form the basis for all of the probabilistic machinery that we usually encounter while studying the machine learning models.

Now if one can derive the two rules of probability then it will trivial to differentiate between joint probability and conditional probability. To derive the two rules lets take the figure given below:

We can derive the sum and product rules of probability by considering two random variables, X, which takes the values {xi} where i = 1, . . . , M, and Y, which takes the values {yj} where j = 1, . . . , L. In this illustration we have M = 5 and L = 3. If we consider a total number N of instances of these variables, then we denote the number of instances where X = xi and Y = yj by nij, which is the number of points in the corresponding cell of the array. The number of points in column i, corresponding to X = xi, is denoted by ci, and the number of points in row j, corresponding to Y = yj , is denoted by rj .

The probability that X will take the value xi and Y will take the value yj is written p(X = xi, Y = yj) and is called the joint probability of X = xi and Y = yj . It is given by the number of points falling in the cell i,j as a fraction of the total number of points, and hence

Similarly, the probability that X takes the value xi irrespective of the value of Y is written as p(X = xi) and is given by the fraction of the total number of points that fall in column i, so that

Now, since the number of instances in column i is just the sum of the number of instances in each cell of that column, hence we can write

Putting the value of nij and ci in the equation for p(X = xi), one can get the below result,

which is the sum rule of probability. Note that p(X = xi) is sometimes called the marginal probability, because it is obtained by marginalizing, or summing out, the other variables (in this case Y ).

If we consider only those instances for which X = xi, then the fraction of such instances for which Y = yj is written p(Y = yj |X = xi) and is called the conditional probability of Y = yj given X = xi. It is obtained by finding the fraction of those points in column i that fall in cell i,j and hence is given by

Now using this and previous equations for joint and marginal probabilities one can get to this below given result

Which is the product rule of probability.

A joint probability is termed as the probability of X and Y, from here we can conclude that joint probability is symmetric so p(X, Y) = p(Y, X). If we use the just above mentioned equation and apply the symmetry property of joint probability then we get the conditional probability

which is called Bayes’ theorem and plays a central role in pattern recognition and machine learning.

Further reading:

Although “Independence” is not related to the title of this article, I would really recommend having the knowledge of Independence and conditional independence because in numerical they are very helpful to understand the minute differences and improves the understanding to a greater extent.

Artificial Intelligence at IISc