Loading...
Things to Note
Maximum Likelihood Estimation (MLE) is a statisitical method that estimates the parameters of a model or distribution my maximizing the likelihood function. The likelihood function is a function of the parameters where the probability of observing some fixed data given those parameters is quantified.
Parametric Family: A collection of distributions that are characterized by a common mathematical form and set of parameters, but differ based on values of those parameters.
Examples of parametric families include:
Closed-Form Solution: Can be written explicitly in terms of known functions and operations.
Numerical Method: Computational technique used to approximate solutions.
Assume we have a biased coin whose probability of heads is unknown. We can define a bernoulli distribution with parameter that models the probability of heads.
Problem:
We want to find the probability of observing the data given a parameter . It may seem obvious that should be 4/5. However, we would like a more principled and general mathematical approach to deal with more complex models (other parametric families).
For example, which is more likely? or ?
If , then
If , then
We can do this for every value and find the that maximizes . This is the maximum likelihood estimate.
Parameter estimation by MLE solves problems where
The likelihood function is defined as:
Recall that the probabilities are multiplied together because the samples are i.i.d.
Now that we have a likelihood function, we can reduce our parameter estimation problem to an optimization problem:
As our dataset grows, the product of conditional probabilities becomes very small. To avoid the computational complexity associated with such small numbers, we can work with the log-likelihood instead. This works because the maximum of both functions occurs in the same location.
Log-likelihood is defined as:
Thus, our new optimization function becomes:
Note: The likelihood and log-likelihood functions have the same maximum because both the log and linear functions monotonically increase, but the log function grows logarithmically slower rate than a regular linear function. Because of this slower rate of growth, maximizing the log is not only easier to compute because of the summation, but far more stable as the number of samples increases (see for yourself in the interactive visualization below).
There are cases where we can find a closed-form solution to the optimization problem, but in most cases, we must use a numerical method to find the maximum.
So, going back to our biased coin example, we can find the maximum of the log-likelihood function:
We then find where the gradient is 0:
Using MLE to estimate the parameters for problems with closed-form solutions like this works fine, but we can loosen the constraints of our optimization problem by estimating something called the log-odds instead.
First, why is estimating the log-odds useful?
I will show you how to do this for the biased coin problem (or any Bernoulli distribution for that matter) but keep in mind that this can be applied to more complicated distributions.
Given where , i.i.d. samples drawn from a Bernoulli distribution:
We can actually re-write this using a single equation which makes optimization even easier.
We are mapping some equation using a set of weights (our log-odds) to the same range as and assume they are equal. As mentioned earlier, this is a nice trick (used in a lot of proofs) because we don't need to worry about the constraint .
Notice that if x = 1, the equation is and if x = 0, it is . These are the same equations used in our distribution definition earlier.

Also notice that this is a valid probability distribution because both and will sum to 1 and are always positive.
Now, let's try to estimate .
Where
Now that we have the log-likelihood function, we calculate the gradient, set it to 0, and then solve for .
We can then use to solve for using the distribution definition from earlier.
Given , i.i.d. draws from a Gaussian distribution:
The probability density function (PDF) is:
Where :
PDF Properties

We estimate and jointly by finding first and then using to calculate .
The only part of this equation that is dependent on is the last part. Because of this, if we find the that minimizes this term, that will be the that maximizes the log-likelihood.
To find the optimal , we calculate the gradient, set it to 0, and then solve for .
We are given i.i.d samples drawn from a Gaussian
Where is some function that estimates the mean and is parameterized by .
Now, we must estimate and
//todo: insert pic here
We solve this by maximizing the conditional likelihood instead of the marginal likelihood.
If we want the maximize the the likelihood function above with respect to , notice that the only term in the summation that is dependent on is the last. Also, notice that the last term is negated, so if we want to maximize the log-likelihood, we should minimize this term. This means our optimization for reduces to:
This is the Mean Squared Error (MSE). This is another explanation for why MSE is used for coefficient estimation in regression tasks.
To estimate , we just need to observe the terms in the log-likelihood that depend on and use those in a simplified optimization function. The second and third terms depend on and are negated so our optimization function becomes (We will find the optimal for simplicity's sake:
This is the average square loss.
Just like we can use MLE in regression to predict continuous labels, we can also use MLE for classification via logistic regression.
We are given where
Estimate .
It is impossible to come up with a closed form solution for this, so we need to resort to numerical methods like gradient descent.
The Maximum Likelihood Estimator is a random variable as a function of the random data as noted below.
Where the samples are identically and independently distributed from the true distribution
Essentially, from an unknown parameter, observations are generated. Then, from those observations, the estimator is generated.
Because is a RV, we can try to understand its statistical behavior, which relates to the analysis of evaluation metrics like bias, variance, and mean squared error (MSE).
Bias: Difference of expectation of MLE and the true parameter
If you don't understand the above equation, recall the definition of expectation:
Where
In simple cases, we have solutions for the bias, but we don't for more complicated ones.
Variance: The expected distance squared of the estimator from it's expected value. i.e. It measures the spread of the estimator around its expected value.
Which means
Bias vs. Variance:
//TODO: Add images here
Mean Squared Error (MSE): the expected distance squared of the estimator from its true value.
MSE Equation:
Bias-Variance Decomposition of MSE:
Proof of Bias-Variance Decomposition:
Thus, we can change our equation to
Thus,
And this is where the Bias-variance trade-off comes from. If the MSE is a constant and the bias goes up, the variance comes down and vis versa. We can find the best bias/variance tradoff point by minimizing MSE.
If we call our estimator "unbiased".
If we call our estimator "consistent".
If our estimator has an "asymtotic bias".
Relations:
For , MLE is
Notice that the first term contains the variance inside the summation. The second summation is the producct of two independent events ( and where the covariance = 0), so it will be 0.
We can sumplify the equation above further
As , consistent unbiased
Now let's quickly analyze so our estimator is biased. However, it is asymtotically unbiased because . , and , so it is consistent.
Side Note: It is possible to have find an unbiased estimator for . Consider the following alternative estimator . It is a sample statistic in place of the population parameter that incorporates Bessel's correction ( instead of ) to compensate for the fact that is an estimate from the sample and not the true mean. Without it, the sample variance would systematically underestimate the true variance:
Which is unbiased.
One thing to note is that Maximum Likelihood Estimation is almost always consistent. Why?
The short answer is that the distribution of the observations becomes more and more similar to population distribution. However, we can give a more rigorous proof be relating MLE to Kullback-Leibler (KL) Divergence. If you don't know what this is, it is a metric used to compare 2 different distributions.
Where is the true population distribution and is our model's distribution.
KL Divergence has the following properties:
Let's prove that for any and .
First, recall Jensen's Inequality.
As a quick refresher of Jensen's Inequality, consider a case where we are given the PDF of where where is a discrete R.V. with only 2 possible points.
If some function over this distribution is convex, then . This is known as Jensen's Inequality.

This can be generalized to a distribution with more than 2 points. We will use this in our proof.
Now, consider the KL divergence formula.
We know that the log function is concave down, but we want it to be convex so that we can apply Jensen's inequality. We want to use Jensen's inequality because it will allow us to put a lower bound on the equation.
We can make the equation convex by negating the log and swapping the numerator and denominator.

Now we can apply Jensen's Inequality.
Therefore,
Now we will prove .
The inequality in Jensen's inequality must be an equality for this to happen.
This would mean that where is the convex function. The only way for this to happen is if all the points in the distribution ( and in the graph above where the distribution only has 2 points) don't have a gap between them so that the point getween them isn't lower.
If and are equal in our earlier example (or any corresponding poiints between the distributions), we can imply that is a constant. This implies
Relating KL-Divergence to MLE
Assume is the data distribution geven and p is the "model". Consider if p wer part of some parametric family.
Our learning problem becomes
Notice that is not dependent on , so our learning problem is actually.
This is the same thing as the average log-likelihood because this is the average on (the data distribution) where .
Maximizing the average log-likelihood is the same thing as maximizing the MLE.
Maximum Likelihood Estimation (MLE) is a powerful and foundational statistical method for parameter estimation. By leveraging the principle of maximizing the likelihood function, MLE provides a systematic approach to infer the parameters of a model based on observed data. Its versatility is evident as it is applicable to a wide range of parametric families, from simple cases like a biased coin to more complex scenarios such as regression tasks or Gaussian distributions.
Throughout this blog post, we explored key aspects of MLE:
Whether you're building predictive models, designing experiments, or analyzing data, MLE provides a strong statistical foundation. Its principles extend beyond traditional parametric models and are central to many machine learning algorithms.
By mastering MLE, you unlock a tool that combines theoretical rigor with practical applicability, empowering you to tackle a wide array of problems with confidence.