# Theoretical Machine Learning: Probabilistic and Statistical Math

A quick summary of probabilistic math used in machine learning.

- Quick Probability Overview
- Gaussian Distribution
- Why was all this important?
- Set Prior on the variables also

I will start with a quick overview of probability and then dive into details of Gaussian distribution. In the end I have provided links to some of the theoretical books that I have read for the concepts mentioned here.

Why is there a need for uncertainty (probability) in machine learning (ML)? Generally when we are given a dataset and we fit a model on that data what we want to do actually is capture the property of dataset i.e. the underlying regularity in order to generalize well to unseen data. But the individual observations are corrupted by random noise (due to sources of variability that are themselves unobserved). This is what is termed as **Polynomial Curve Fitting** in ML.

In curve fitting, we usually use the **maximum likelihood approach** which suffers from the problem of overfitting (as we can easily increase the complexity of the model to learn the train data).

To overcome overfitting we use **regularization techniques** by penalizing the parameters. We do not penalize the bias term as it’s inclusion in the regularization causes the result to depend on the choice of origin.

## Quick Probability Overview

Consider an event of tossing a coin. We use events to describe various possible states in our universe. So we represent the possible states as *X = Event that heads come and Y = Event that tails come*. Now P(X) = Probability of that particular event happening.

Takeaway, represent your probabilities as events of something happening.

In the case of ML, we have P(X = x) which means, the probability of observing the value x for our variable X. Note I changed from using the word event to a variable.

Next, we discuss **expectation**. One of the most important concepts. When we say E[X] (expectation of X ) we are saying, that we want to find the average of the variable X. As a quick test, what is the E[x] where x is some observed value?

Suppose x takes the value 10, the average of that value is the number itself.

Next, we discuss **variance**. Probably the most frustrating part of an ML project when we have to deal with large variances. Suppose we found the expectation (which we generally call mean) of some values. Then variance tells us the average distance of each value from the mean i.e. it tells how spread a distribution is.

Another important concept in probabilistic maths is the concept of **prior**, **posterior** and **likelihood**.

Prior = P(X). It is probability available before we observing an event\

Posterior = P(X|Y). It is the probability of X after event Y has happened.\

Likelihood = P(Y|X). It tells how probable it is for the event Y to happen given the current settings i.e. X\

Now when we use maximum likelihood we are adjusting the values of X to get to maximize the likelihood function P(Y|X).

**Tip:**Follow the resource list given at the end for further reading of the topic.

But before that let us discuss why we need to know about **probability distributions** in the first place?

When we are given a dataset and we want to make predictions about new data that has not been observed then what we essentially want is a formula in which we pass the input value and get the output values as output.

Let’s see how we get to that formula. Initially, we are given input data (X). Now, this data would also come from a formula, which I name probability distribution (the original distribution is although corrupted from random noises). So we set a prior for the input data. And this prior comes in the form of a probability distribution (Gaussian in most cases).

**Note:**From a purely theoretical perspective we are simply following Bayesian statistics and filling in values to the Baye’s Rule.

As a quick test, if we already know a probability distribution for the input data then why we need to make complex ML models?

After assuming an input distribution we then need to assume some complexity over the model that we want to fit on the input data. Why do we want to do this? Simply because there are infinitely many figures that we draw that would pass through the given two points. It is up to us to decide whether the figure is linear, polynomial.

Now in any ML model, we have weights that we have to finetune. We can either assume constant values for these weights or we can go even a bit deeper and assume a prior for these weights also. More on this later.

**Note:**In this post, I am approaching ML from a statistics point of view and not the practical way of using backpropagation to finetune the values of the weights.

Now I present a general way of approaching the problem of finding the values for the variables of a prior.

The Gaussian Equation is represented as follow

$$\mathcal{N}(x|\mu,\sigma^2) = \frac{1}{(2\pi\sigma^2)^{1/2}}exp\{-\frac{1}{2\sigma^2}(x-\mu)^2\}$$

Now when we assume a Gaussian prior for input data we have to deal with two variables namely the mean and variance of the distribution and it is a common hurdle when you assume some other distribution.

So how to get the value of these variables. This is where maximum likelihood comes into play. Say you observed **N** values as input. Now all these **N** values are i.i.d. (independently identically distributed), so the combined joint probability (likelihood function) can be represented as

$$p(x|\mu,\sigma^2)=\prod_{n=1}^{N}\mathcal{N}(x_n|\mu,\sigma^2)$$

After getting the likelihood function, we now want to maximize this function with respect to the variables one by one. To make our life easier we usually take the log of the above value as log is a monotonically increasing function.

$$\ln p(x|\mu,\sigma^2)=-\frac{1}{2\sigma^2}\sum_{n=1}^{N}(x_n-\mu)^2-\frac{N}{2}\ln \sigma^2-\frac{N}{2}\ln (2\pi)$$

Now take the derivative w.r.t the variables and get their values.

$$\mu_{ML}=\frac{1}{N}\prod_{n=1}^{N}x_n$$ $$\sigma^2_{ML}=\frac{1}{N}\prod_{n=1}{N}(x_n-\mu_{ML})^2$$

**Note:**In a fully Bayesian setting the above two values represent the prior for that variables and we can update them as we observe new data.

## Why was all this important?

All of you may have heard about the MSE (mean squared error) loss function, but you may not be able to use that loss function for every situation as that loss function is derived after assuming the Gaussian prior on the input data. Similarly, other loss functions like Softmax are also derived for that particular prior.

In cases where you have to take a prior like Poisson MSE would not be a good metric to consider.

Another design choice that you can make is by assuming that the variable values also follow a probability distribution. How to choose that distribution?

Ideally, you can choose any distribution. But practically you want to choose a **conjugate prior** for the variables. Suppose your prior for the input data is Gaussian. And now you want to select a prior for the mean. You must choose a prior such that after applying the Baye’s rule the resulting distribution for the input data is still Gaussian and same for variance also. These are called conjugate priors.

Just for a reference, conjugate prior for the mean is also Gaussian and for the variance and inverse gamma for the variance.

Congratulations if you made it to the end of the post. I rarely scratched the surface but I tried to present the material in a more interactive manner focused more on building intuition.

Here is the list of resources that I would highly recommend for learning ML:-

- CS229: Machine Learning by Stanford
- Pattern Recognition and Machine Learning by Christopher M. Bishop
- An Introduction to Statistical Learning: with Applications in R
- The Elements of Statistical Learning Data Mining, inference and Prediction