• Overview
  • Projects
  • Blog
  • Skills
  • Education
  • Interests

Maximum Likelihood Estimation (MLE)

Loading...

How to Use the Visualization

  1. Adjust θ∗\theta^*θ∗ (the true parameter): Use the slider to change the parameter.
  2. Generate Samples: Click the button to simulate coin flips.
  3. Observe Likelihood and Log-Likelihood: Watch how plots change as you adjust the parameters and generate more samples.
  4. Observe the estimated parameter at the top (θ^\hat{\theta}θ^): This is the x value where the max log-likelihood and max likelihood values occur.

Things to Note

  • The scale on the y-axis of the likelihood plot gets extremely small as the number of samples increases.
  • The likelihood plot moves around a lot more than the log-likelihood plot.
  • The maximum occurs at the same place in both plots leading to the same parameter estimate.

Theory

Main Idea

Maximum Likelihood Estimation (MLE) is a statisitical method that estimates the parameters of a model or distribution my maximizing the likelihood function. The likelihood function is a function of the parameters where the probability of observing some fixed data given those parameters is quantified.

Quick Review

Parametric Family: A collection of distributions that are characterized by a common mathematical form and set of parameters, but differ based on values of those parameters.

Examples of parametric families include:

  • Normal (Gaussian): f(x∣μ,σ2)f(x \mid \mu, \sigma^2)f(x∣μ,σ2)
  • Exponential: f(x∣λ)f(x \mid \lambda)f(x∣λ)
  • Gamma: f(x∣k,θ)f(x \mid k, \theta)f(x∣k,θ)
  • Bernoulli: f(x∣θ)f(x \mid \theta)f(x∣θ)

Closed-Form Solution: Can be written explicitly in terms of known functions and operations.

Numerical Method: Computational technique used to approximate solutions.

Illustative Example - Biased Coin

Assume we have a biased coin whose probability of heads is unknown. We can define a bernoulli distribution with parameter θ\thetaθ that models the probability of heads.

P(X=x)={θif x=H,1−θif x=T.P(X = x) = \begin{cases} \theta & \text{if } x = H, \\ 1 - \theta & \text{if } x = T. \end{cases}P(X=x)={θ1−θ​if x=H,if x=T.​

Problem:

  • Given a set of independent observations HHHHTHHHHTHHHHT (our set and known observations), estimate θ\thetaθ
  • where XXX is a random variable and θ\thetaθ is our unknown variable

We want to find the probability of observing the data given a parameter θ\thetaθ. It may seem obvious that θ\thetaθ should be 4/5. However, we would like a more principled and general mathematical approach to deal with more complex models (other parametric families).

For example, which is more likely? θ=.9\theta = .9θ=.9 or θ=.1\theta = .1θ=.1?

If θ=.1\theta = .1θ=.1, then P(HHHHT∣θ)=θ4(1−θ)=(.1)4(.9)≈.00009P(HHHHT \mid \theta) = \theta^4(1-\theta) = (.1)^4(.9) \approx .00009P(HHHHT∣θ)=θ4(1−θ)=(.1)4(.9)≈.00009

If θ=.9\theta = .9θ=.9, then P(HHHHT∣θ)=(.9)4(.1)>.00009P(HHHHT \mid \theta) = (.9)^4(.1) > .00009P(HHHHT∣θ)=(.9)4(.1)>.00009

We can do this for every value θ∈[0,1]\theta \in [0, 1]θ∈[0,1] and find the θ\thetaθ that maximizes P(HHHHT∣θ)P(HHHHT \mid \theta)P(HHHHT∣θ). This is the maximum likelihood estimate.

θ^=arg max⁡θL(θ)=arg max⁡θ=P(HHHHT∣θ)=θ4(1−θ)\hat{\theta} = \argmax_{\theta} \mathcal{L}(\theta) = \argmax_{\theta} = P(HHHHT \mid \theta) = \theta^4(1-\theta) θ^=θargmax​L(θ)=θargmax​=P(HHHHT∣θ)=θ4(1−θ)

Formal Definition

Parameter estimation by MLE solves problems where

  • Given a set of observations {xi}i=1n\{x_i\}_{i=1}^n{xi​}i=1n​
  • where each observation is idependent and identically distributed (i.i.d.) following an unknown distribution P∗\mathbb{P}^*P∗ from a
    parametric family of distriputions: {P(⋅∣θ):θ∈Θ}\{\mathbb{P}(\cdot \mid \theta) : \theta \in \Theta \}{P(⋅∣θ):θ∈Θ}
  • And you want to estemate the parameter θ\thetaθ

The likelihood function is defined as:

L(θ)=P(x1,…,xn∣θ)=∏i=1nP(xi∣θ)\mathcal{L}(\theta) = \mathbb{P}(x_1, \dots, x_n \mid \theta) = \prod_{i=1}^n \mathbb{P}(x_i \mid \theta)L(θ)=P(x1​,…,xn​∣θ)=i=1∏n​P(xi​∣θ)

Recall that the probabilities are multiplied together because the samples are i.i.d.

Now that we have a likelihood function, we can reduce our parameter estimation problem to an optimization problem:

θ^=arg max⁡θL(θ)\hat{\theta} = \argmax_{\theta} \mathcal{L}(\theta)θ^=θargmax​L(θ)

As our dataset grows, the product of conditional probabilities becomes very small. To avoid the computational complexity associated with such small numbers, we can work with the log-likelihood instead. This works because the maximum of both functions occurs in the same location.

Log-likelihood is defined as:

ℓ(θ)=log⁡(L(θ))=log⁡(∏i=1nP(xi∣θ))=∑i=1nlog⁡(P(xi∣θ))\ell(\theta) = \log(\mathcal{L}(\theta)) = \log(\prod_{i=1}^n \mathbb{P}(x_i \mid \theta)) = \sum_{i=1}^n \log(\mathbb{P}(x_i \mid \theta))ℓ(θ)=log(L(θ))=log(i=1∏n​P(xi​∣θ))=i=1∑n​log(P(xi​∣θ))

Thus, our new optimization function becomes:

θ^=arg max⁡θℓ(θ)=arg max⁡θ∑i=1nlog⁡(P(xi∣θ))\hat{\theta} = \argmax_{\theta} \ell(\theta) = \argmax_{\theta} \sum_{i=1}^n \log(\mathbb{P}(x_i \mid \theta))θ^=θargmax​ℓ(θ)=θargmax​i=1∑n​log(P(xi​∣θ))

Note: The likelihood and log-likelihood functions have the same maximum because both the log and linear functions monotonically increase, but the log function grows logarithmically slower rate than a regular linear function. Because of this slower rate of growth, maximizing the log is not only easier to compute because of the summation, but far more stable as the number of samples increases (see for yourself in the interactive visualization below).

There are cases where we can find a closed-form solution to the optimization problem, but in most cases, we must use a numerical method to find the maximum.

Biased Coin Continued...

So, going back to our biased coin example, we can find the maximum of the log-likelihood function:

θ^=arg max⁡θℓ(θ)=arg max⁡θlog⁡(L(θ))=arg max⁡θlog⁡(P(HHHHT∣θ))=arg max⁡θlog⁡(θ4(1−θ))=arg max⁡θ[4log⁡(θ)+log⁡(1−θ)]\hat{\theta} = \argmax_{\theta} \ell(\theta) = \argmax_{\theta} \log(\mathcal{L}(\theta)) = \argmax_{\theta} \log(P(HHHHT \mid \theta)) = \argmax_{\theta} \log(\theta^4(1-\theta) ) = \argmax_{\theta} [4\log(\theta) + \log(1-\theta)]θ^=θargmax​ℓ(θ)=θargmax​log(L(θ))=θargmax​log(P(HHHHT∣θ))=θargmax​log(θ4(1−θ))=θargmax​[4log(θ)+log(1−θ)]

We then find where the gradient is 0:

∇ℓ(θ)=4θ−11−θ=0\nabla \ell(\theta) = \frac{4}{\theta} - \frac{1}{1-\theta} = 0∇ℓ(θ)=θ4​−1−θ1​=0   ⟹  θ^=45\implies \hat{\theta} = \frac{4}{5}⟹θ^=54​

Log-Odds

Using MLE to estimate the parameters for problems with closed-form solutions like this works fine, but we can loosen the constraints of our optimization problem by estimating something called the log-odds instead.

First, why is estimating the log-odds useful?

  • Unconstrained Optimization: Estimating the log-odds is advantagous because we aren't confined to the range [0,1][0, 1][0,1]. θ\thetaθ is because it's a probability.
  • Improved Numerical Stability: This is particularly useful when θ\thetaθ approaches 0 or 1 because the log of probabilites parameterized by θ\thetaθ can be undefined depending on the distribution. For example, if θ\thetaθ is 0 or 1 in a Bernoulli distribution (like in the biased coin example), the log-likelihood is undefined.

I will show you how to do this for the biased coin problem (or any Bernoulli distribution for that matter) but keep in mind that this can be applied to more complicated distributions.

Given {xi}i=1n\{x_i\}_{i=1}^n{xi​}i=1n​ where xi∈{0,1}x_i \in \{0, 1\}xi​∈{0,1}, i.i.d. samples drawn from a Bernoulli distribution:

P(X=x)={exp⁡(w)1+exp⁡(w)≜θif x=1,11+exp⁡(w)≜1−θif x=0.P(X = x) = \begin{cases} \frac{\exp(w)}{1 + \exp(w)} \triangleq \theta & \text{if } x = 1, \\ \frac{1}{1 + \exp(w)} \triangleq 1 - \theta & \text{if } x = 0. \end{cases}P(X=x)={1+exp(w)exp(w)​≜θ1+exp(w)1​≜1−θ​if x=1,if x=0.​

We can actually re-write this using a single equation which makes optimization even easier.

P(X=x)=exp⁡(wx)1+exp⁡(w)P(X = x) = \frac{\exp(wx)}{1 + \exp(w)}P(X=x)=1+exp(w)exp(wx)​

We are mapping some equation using a set of weights www (our log-odds) to the same range as θ\thetaθ and assume they are equal. As mentioned earlier, this is a nice trick (used in a lot of proofs) because we don't need to worry about the constraint θ∈[0,1]\theta \in [0, 1]θ∈[0,1].

Notice that if x = 1, the equation is P(X=1)=exp⁡(w)1+exp⁡(w)P(X = 1) = \frac{\exp(w)}{1 + \exp(w)}P(X=1)=1+exp(w)exp(w)​ and if x = 0, it is P(X=0)=11+exp⁡(w)P(X = 0) = \frac{1}{1 + \exp(w)}P(X=0)=1+exp(w)1​. These are the same equations used in our distribution definition earlier.

Jensen's Inequality

Also notice that this is a valid probability distribution because both P(X=1)P(X=1)P(X=1) and P(X=0)P(X=0)P(X=0) will sum to 1 and are always positive.

Now, let's try to estimate www.

ℓ(θ)=∑i=1nlog⁡(P(xi∣θ))\ell(\theta) = \sum_{i=1}^n \log(P(x_i \mid \theta)) ℓ(θ)=i=1∑n​log(P(xi​∣θ)) =∑i=1n[xiw−log⁡(1+exp⁡(w))]=(∑i=1nxi)w−nlog⁡(1+exp⁡(w))= \sum_{i=1}^n \left[ x_i w - \log\left(1 + \exp(w)\right) \right] = \left( \sum_{i=1}^n x_i \right) w - n \log\left(1 + \exp(w)\right)=i=1∑n​[xi​w−log(1+exp(w))]=(i=1∑n​xi​)w−nlog(1+exp(w)) =n(xˉw−log⁡(1+exp⁡(w)))= n \left( \bar{x}w - \log\left(1 + \exp(w)\right) \right)=n(xˉw−log(1+exp(w)))

Where xˉ=1n∑i=1nxi\bar{x} = \frac{1}{n} \sum_{i=1}^n x_ixˉ=n1​∑i=1n​xi​

Now that we have the log-likelihood function, we calculate the gradient, set it to 0, and then solve for w^\hat{w}w^.

∇ℓ(w^)=n(xˉw^−log⁡(1+exp⁡(w^)))=0\nabla \ell(\hat{w}) = n \left( \bar{x}\hat{w} - \log\left(1 + \exp(\hat{w})\right) \right) = 0 ∇ℓ(w^)=n(xˉw^−log(1+exp(w^)))=0   ⟹  w^=log⁡(xˉ1−xˉ)\implies \hat{w} = \log\left(\frac{\bar{x}}{1 - \bar{x}}\right)⟹w^=log(1−xˉxˉ​)

We can then use w^\hat{w}w^ to solve for θ^\hat{\theta}θ^ using the distribution definition from earlier.

θ^=exp⁡(w^)1+exp⁡(w^)\hat{\theta} = \frac{\exp(\hat{w})}{1 + \exp(\hat{w})}θ^=1+exp(w^)exp(w^)​

Example - MLE for a Gaussian Distribution

Given {xi}i=1n\{x_i\}_{i=1}^n{xi​}i=1n​, i.i.d. draws from a Gaussian distribution:

N(μ,σ2)N(\mu, \sigma^2)N(μ,σ2)

The probability density function (PDF) is:

P(x∣θ)=12πσexp⁡(−(x−μ)22σ2)P(x|\theta) = \frac{1}{\sqrt{2\pi}\sigma} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)P(x∣θ)=2π​σ1​exp(−2σ2(x−μ)2​)

Where θ=(μ,σ)\theta = (\mu, \sigma)θ=(μ,σ):

  • μ\muμ: Mean
  • σ2\sigma^2σ2: Variance

PDF Properties

  • The exponent is scaled by −12σ2-\frac{1}{2\sigma^2}−2σ21​, to give us a positive constant factor.
  • 12πσ\frac{1}{\sqrt{2\pi}\sigma}2π​σ1​ is a normalization factor ensuring the integration over the PDF equals 1: ∫P(x∣θ)dx=1\int P(x|\theta) dx = 1∫P(x∣θ)dx=1
Jensen's Inequality

We estimate μ\muμ and σ\sigmaσ jointly by finding μ^\hat{\mu}μ^​ first and then using μ^\hat{\mu}μ^​ to calculate σ^\hat{\sigma}σ^.

ℓ(θ)=∑i=1nlog⁡p(xi∣θ)=∑i=1n(log⁡(1(2π)1/2σ)−12σ2(xi−μ)2)\ell(\theta) = \sum_{i=1}^n \log p(x_i | \theta) = \sum_{i=1}^n \left( \log\left(\frac{1}{(2\pi)^{1/2} \sigma}\right) - \frac{1}{2\sigma^2}(x_i - \mu)^2 \right)ℓ(θ)=i=1∑n​logp(xi​∣θ)=i=1∑n​(log((2π)1/2σ1​)−2σ21​(xi​−μ)2) =−nlog⁡(2π)−nlog⁡σ−12σ2∑i=1n(xi−μ)2= \boxed{-n \log(2\pi) - n \log \sigma - \frac{1}{2\sigma^2} \sum_{i=1}^n (x_i - \mu)^2}=−nlog(2π)−nlogσ−2σ21​i=1∑n​(xi​−μ)2​

The only part of this equation that is dependent on μ\muμ is the last part. Because of this, if we find the μ^\hat{\mu}μ^​ that minimizes this term, that will be the μ^\hat{\mu}μ^​ that maximizes the log-likelihood.

μ^=arg⁡min⁡μ∑i=1n(xi−μ)2\hat{\mu} = \underset{\mu}{\arg\min} \sum_{i=1}^n (x_i - \mu)^2μ^​=μargmin​i=1∑n​(xi​−μ)2

To find the optimal μ^\hat{\mu}μ^​, we calculate the gradient, set it to 0, and then solve for μ^\hat{\mu}μ^​.

Example - MLE in Regression

We are given {xi,yi}i=1n\{x_i, y_i\}_{i=1}^n{xi​,yi​}i=1n​ i.i.d samples drawn from a Gaussian

P(y∣x;θ,σ)∼N(f(x;θ),σ2)P(y|x;\theta,\sigma)\sim\mathcal{N}(f(x; \theta), \sigma^2)P(y∣x;θ,σ)∼N(f(x;θ),σ2)

Where fff is some function that estimates the mean and is parameterized by θ\thetaθ.

Now, we must estimate θ\thetaθ and σ\sigmaσ

Why does it make sense to model the regression task this way?

//todo: insert pic here

We solve this by maximizing the conditional likelihood P(y∣x;θ,σ2)P(y|x;\theta,\sigma^2)P(y∣x;θ,σ2) instead of the marginal likelihood.

ℓ(θ,σ)=∑i=1nlog(P(yi∣xi;θ,σ))\ell(\theta, \sigma) = \sum_{i=1}^nlog(P(y_i|x_i;\theta,\sigma))ℓ(θ,σ)=i=1∑n​log(P(yi​∣xi​;θ,σ)) =∑i=1n[−12log(2π)−log(σ2)−12σ2(yi−f(xi,θ))2]= \sum_{i=1}^n\left[-\frac{1}{2}log(2\pi)-log(\sigma^2)-\frac{1}{2\sigma^2}(y_i-f(x_i,\theta))^2\right] =i=1∑n​[−21​log(2π)−log(σ2)−2σ21​(yi​−f(xi​,θ))2] =−n2log(2π)−nlog⁡(σ2)−n2σ2∑i=1n(yi−f(xi,θ))2= -\frac{n}{2}log(2\pi)-n\log(\sigma^2)-\frac{n}{2\sigma^2}\sum_{i=1}^n(y_i-f(x_i,\theta))^2=−2n​log(2π)−nlog(σ2)−2σ2n​i=1∑n​(yi​−f(xi​,θ))2

If we want the maximize the the likelihood function above with respect to θ\thetaθ, notice that the only term in the summation that is dependent on θ\thetaθ is the last. Also, notice that the last term is negated, so if we want to maximize the log-likelihood, we should minimize this term. This means our optimization for θ\thetaθ reduces to:

arg max⁡θℓ(θ,σ)→θ^=arg min⁡θ∑i=1n(yi−f(xi,θ))2\argmax_{\theta}\ell(\theta, \sigma) \rightarrow \hat{\theta} = \argmin_{\theta}\sum_{i=1}^n(y_i-f(x_i,\theta))^2θargmax​ℓ(θ,σ)→θ^=θargmin​i=1∑n​(yi​−f(xi​,θ))2

This is the Mean Squared Error (MSE). This is another explanation for why MSE is used for coefficient estimation in regression tasks.

To estimate σ\sigmaσ, we just need to observe the terms in the log-likelihood that depend on σ\sigmaσ and use those in a simplified optimization function. The second and third terms depend on σ\sigmaσ and are negated so our optimization function becomes (We will find the optimal σ2\sigma^2σ2 for simplicity's sake:

arg max⁡σ2ℓ(θ^,σ)→σ^2=arg min⁡σ2[nlog⁡(σ2)+n2σ2∑i=1n(yi−f(xi,θ^))2]\argmax_{\sigma^2}\ell(\hat{\theta}, \sigma) \rightarrow \hat{\sigma}^2=\argmin_{\sigma^2}\left[n\log(\sigma^2)+\frac{n}{2\sigma^2}\sum_{i=1}^n(y_i-f(x_i,\hat{\theta}))^2\right] σ2argmax​ℓ(θ^,σ)→σ^2=σ2argmin​[nlog(σ2)+2σ2n​i=1∑n​(yi​−f(xi​,θ^))2] =1n∑i=1n(yi−f(xi,θ^))2= \frac{1}{n}\sum_{i=1}^n\left(y_i - f(x_i, \hat{\theta})\right)^2=n1​i=1∑n​(yi​−f(xi​,θ^))2

This is the average square loss.

Just like we can use MLE in regression to predict continuous labels, we can also use MLE for classification via logistic regression.

Example - MLE For Logistic Regression

We are given {xi,yi}i=1n\{x_i, y_i\}_{i=1}^n{xi​,yi​}i=1n​ where yi∈{0,1},s.t.y_i \in \{0, 1\}, s.t.yi​∈{0,1},s.t.

P(y∣x;θ)=exp⁡(yf(x,θ))1+exp⁡(f(x,θ))P(y|x;\theta) = \frac{\exp(yf(x,\theta))}{1 + \exp(f(x,\theta))}P(y∣x;θ)=1+exp(f(x,θ))exp(yf(x,θ))​

Estimate θ\thetaθ.

P(y=0∣x;θ)=11+exp⁡(f(x,θ))P(y=0|x;\theta) = \frac{1}{1 + \exp(f(x,\theta))}P(y=0∣x;θ)=1+exp(f(x,θ))1​ P(y=1∣x;θ)=exp⁡(f(x,θ))1+exp⁡(f(x,θ))P(y=1|x;\theta) = \frac{\exp(f(x,\theta))}{1 + \exp(f(x,\theta))}P(y=1∣x;θ)=1+exp(f(x,θ))exp(f(x,θ))​ ℓ(θ)=∑i=1nlogP(yi∣xi,θ)=∑i=1n[yif(xi,θ)−log⁡(1+exp⁡(f(xi,θ)))]\ell(\theta)=\sum_{i=1}^nlogP(y_i|x_i, \theta)=\sum_{i=1}^n\left[y_if(x_i, \theta)-\log(1+\exp(f(x_i, \theta)))\right]ℓ(θ)=i=1∑n​logP(yi​∣xi​,θ)=i=1∑n​[yi​f(xi​,θ)−log(1+exp(f(xi​,θ)))]

It is impossible to come up with a closed form solution for this, so we need to resort to numerical methods like gradient descent.

Evaluation Metrics and MLE

The Maximum Likelihood Estimator is a random variable as a function of the random data as noted below.

θ^=θ^(x1,x2,…,xn)\hat{\theta}=\hat{\theta}(x_1, x_2,\ldots,x_n)θ^=θ^(x1​,x2​,…,xn​)

Where the samples are identically and independently distributed from the true distribution

{xi}∼ρ(⋅∣θ∗)\{x_i\} \sim \rho(\cdot \mid \theta^*){xi​}∼ρ(⋅∣θ∗)

Essentially, from an unknown parameter, observations are generated. Then, from those observations, the estimator is generated.

θ∗→{xi}i=12→θ^\theta^* \rightarrow \{x_i\}_{i=1}^2 \rightarrow \hat{\theta}θ∗→{xi​}i=12​→θ^

Because θ^\hat{\theta}θ^ is a RV, we can try to understand its statistical behavior, which relates to the analysis of evaluation metrics like bias, variance, and mean squared error (MSE).

Bias: Difference of expectation of MLE and the true parameter

Bias(θ^)=E[θ^(x1,x2,…,xn)]−θ∗=∫−∞∞θ^(x1,x2,…,xn)∏i=1nP(xi∣θ∗) dx−θ∗Bias(\hat{\theta}) = \mathbb{E}\left[\hat{\theta}(x_1, x_2,\ldots,x_n)\right] - \theta^* = \int_{-\infty}^\infty \hat{\theta}(x_1, x_2,\ldots,x_n)\prod_{i=1}^nP(x_i \mid \theta^*) \, dx - \theta^*Bias(θ^)=E[θ^(x1​,x2​,…,xn​)]−θ∗=∫−∞∞​θ^(x1​,x2​,…,xn​)i=1∏n​P(xi​∣θ∗)dx−θ∗

If you don't understand the above equation, recall the definition of expectation:

E[g(x)]=∫−∞∞g(x)fx(x) dx\mathbb{E}[g(x)] = \int_{-\infty}^{\infty} g(x) f_x(x) \, dxE[g(x)]=∫−∞∞​g(x)fx​(x)dx

Where

  • g(x)g(x)g(x) is a function of R.V. XXX. In our case this is θ^(x1,x2,…,xn)\hat{\theta}(x_1, x_2,\ldots,x_n)θ^(x1​,x2​,…,xn​)
  • fx(x)f_x(x)fx​(x) is the PDF of the R.V. XXX.

In simple cases, we have solutions for the bias, but we don't for more complicated ones.

Variance: The expected distance squared of the estimator from it's expected value. i.e. It measures the spread of the estimator θ^\hat{\theta}θ^ around its expected value.

  • It helps to quantify how much θ^\hat{\theta}θ^ varies across different samples of data.
Var(θ^)=Eθ∗[(θ^(x1,x2,…,xn)−Eθ∗[θ^(x1,x2,…,xn)])2]Var(\hat{\theta}) = \mathbb{E}_{\theta^*}\left[(\hat{\theta}(x_1, x_2,\ldots,x_n) - \mathbb{E}_{\theta^*}\left[\hat{\theta}(x_1, x_2,\ldots,x_n)\right])^2\right]Var(θ^)=Eθ∗​[(θ^(x1​,x2​,…,xn​)−Eθ∗​[θ^(x1​,x2​,…,xn​)])2]

Which means

Var(R.V.)=Expectation[R.V.−Expectation of R.V]Var(\text{R.V.}) = \text{Expectation}\left[\text{R.V.} - \text{Expectation of R.V}\right]Var(R.V.)=Expectation[R.V.−Expectation of R.V]

Bias vs. Variance:

  • Bias measures the difference between the mean and the true parameter.
  • Variance measures the fluctuation around the mean.
  • We want both to be small.

//TODO: Add images here

Mean Squared Error (MSE): the expected distance squared of the estimator from its true value.

  • This is almost the same as variance. The only difference is that the expectation of the estimator θ^\hat{\theta}θ^ is replaced with the true parameter θ∗\theta^*θ∗.
  • The MSE measures both the bias and the variance. Because we want both the bias and the variance to be small, we also want the MSE to be small.

MSE Equation:

MSE(θ^)=Eθ∗[(θ^(X1,…,Xn)−θ∗)2]\text{MSE}(\hat{\theta}) = \mathbb{E}_{\theta^*} \left[ \left( \hat{\theta}(X_1, \ldots, X_n) - \theta^* \right)^2 \right]MSE(θ^)=Eθ∗​[(θ^(X1​,…,Xn​)−θ∗)2]

Bias-Variance Decomposition of MSE:

MSE(θ^)=(Bias(θ^))2+Var(θ^)\text{MSE}(\hat{\theta}) = \left(\text{Bias}(\hat{\theta})\right)^2 + \text{Var}(\hat{\theta})MSE(θ^)=(Bias(θ^))2+Var(θ^)

Proof of Bias-Variance Decomposition:

  1. Start with the simplified definition of MSE
MSE(θ^)=E[(θ^−θ∗)2]\text{MSE}(\hat{\theta}) = \mathbb{E}\left[ \left( \hat{\theta} - \theta^* \right)^2 \right]MSE(θ^)=E[(θ^−θ∗)2]
  1. Subtract and add the expectation of the parameter (The same as adding 0).
E[((θ^−E(θ^))+(E(θ^)−θ∗))2]\mathbb{E}\left[ \left( \left(\hat{\theta} - \mathbb{E}(\hat{\theta})\right) + \left(\mathbb{E}(\hat{\theta}) - \theta^* \right) \right)^2 \right]E[((θ^−E(θ^))+(E(θ^)−θ∗))2]
  1. Expand the polynomial and distribute the expectation.
E[(θ^−E(θ^))2+(E(θ^)−θ∗)2+2(θ^−E(θ^))(E(θ^)−θ∗)]\mathbb{E}\left[ \left( \hat{\theta} - \mathbb{E}(\hat{\theta}) \right)^2 + \left( \mathbb{E}(\hat{\theta}) - \theta^* \right)^2 + 2 \left( \hat{\theta} - \mathbb{E}(\hat{\theta}) \right) \left( \mathbb{E}(\hat{\theta}) - \theta^* \right) \right]E[(θ^−E(θ^))2+(E(θ^)−θ∗)2+2(θ^−E(θ^))(E(θ^)−θ∗)] =E[(θ^−E(θ^))2]+E[(E(θ^)−θ∗)2]+E[2(θ^−E(θ^))(E(θ^)−θ∗)]=\mathbb{E}\left[ \left( \hat{\theta} - \mathbb{E}(\hat{\theta}) \right)^2\right] + \mathbb{E}\left[\left( \mathbb{E}(\hat{\theta}) - \theta^* \right)^2\right] + \mathbb{E}\left[2 \left( \hat{\theta} - \mathbb{E}(\hat{\theta}) \right) \left( \mathbb{E}(\hat{\theta}) - \theta^* \right) \right]=E[(θ^−E(θ^))2]+E[(E(θ^)−θ∗)2]+E[2(θ^−E(θ^))(E(θ^)−θ∗)]
  • Notice that the first term is our the Var(θ^)\text{Var}(\hat{\theta})Var(θ^) and the second term inside the expectation is Bias(θ^)2\text{Bias}(\hat{\theta})^2Bias(θ^)2. We can move the Bias(θ^)2\text{Bias}(\hat{\theta})^2Bias(θ^)2 outside of the expectation because it is a constant (It doesn't directly depend on θ^\hat{\theta}θ^).

Thus, we can change our equation to

=Var(θ^)+Bias(θ^)2+E[2(θ^−E(θ^))(E(θ^)−θ∗)]=\text{Var}(\hat{\theta}) +\text{Bias}(\hat{\theta})^2 + \mathbb{E}\left[2 \left( \hat{\theta} - \mathbb{E}(\hat{\theta}) \right) \left( \mathbb{E}(\hat{\theta}) - \theta^* \right) \right]=Var(θ^)+Bias(θ^)2+E[2(θ^−E(θ^))(E(θ^)−θ∗)]
  1. The last thing we need to do is simplify the final term
E[2(θ^−E(θ^))(E(θ^)−θ∗)]\mathbb{E}\left[2 \left( \hat{\theta} - \mathbb{E}(\hat{\theta}) \right) \left( \mathbb{E}(\hat{\theta}) - \theta^* \right) \right]E[2(θ^−E(θ^))(E(θ^)−θ∗)]
  • Note that the second term inside the expectation is the same constant discussed earlier, so we move it out of the expectation with other constants.
2E[(θ^−E(θ^))](E(θ^)−θ∗)2\mathbb{E}\left[\left( \hat{\theta} - \mathbb{E}(\hat{\theta}) \right) \right]\left( \mathbb{E}(\hat{\theta}) - \theta^* \right)2E[(θ^−E(θ^))](E(θ^)−θ∗)
  • The expectation can be distributed. The expectation of the expectation is just the expectation.
2(E(θ^)−E(θ^))(E(θ^)−θ∗)2\left( \mathbb{E}(\hat{\theta}) - \mathbb{E}(\hat{\theta}) \right)\left( \mathbb{E}(\hat{\theta}) - \theta^* \right)2(E(θ^)−E(θ^))(E(θ^)−θ∗) =2⋅0⋅(E(θ^)−θ∗)=0=2 \cdot 0 \cdot \left( \mathbb{E}(\hat{\theta}) - \theta^* \right) = \boxed{0}=2⋅0⋅(E(θ^)−θ∗)=0​

Thus,

MSE(θ^)=Bias(θ^)2+Var(θ^)\text{MSE}(\hat{\theta}) = \text{Bias}(\hat{\theta})^2 + \text{Var}(\hat{\theta})MSE(θ^)=Bias(θ^)2+Var(θ^)

And this is where the Bias-variance trade-off comes from. If the MSE is a constant and the bias goes up, the variance comes down and vis versa. We can find the best bias/variance tradoff point by minimizing MSE.

If Bias(θ^)=0\text{Bias}(\hat{\theta}) = 0Bias(θ^)=0 we call our estimator θ^\hat{\theta}θ^ "unbiased".
If MSE(θ^)→0 as n→∞\text{MSE}(\hat{\theta}) \rightarrow 0 \text{ as } n \rightarrow \inftyMSE(θ^)→0 as n→∞ we call our estimator θ^\hat{\theta}θ^ "consistent". If Bias(θ^)→0 as n→∞\text{Bias}(\hat{\theta}) \rightarrow 0 \text{ as } n \rightarrow \inftyBias(θ^)→0 as n→∞ our estimator θ^\hat{\theta}θ^ has an "asymtotic bias".

Relations:

  • unbiased ↮\not\leftrightarrow↔ consistent
  • consistent →\rightarrow→ asymtotic bias
  • unbiased →\rightarrow→ asymtotic bias

Example - Calculating MLE Evaluation Metrics For a Gaussian

For {xi}i=1n∼N(μ,σ2)\{x_i\}_{i=1}^n \sim \mathcal{N}(\mu, \sigma^2){xi​}i=1n​∼N(μ,σ2), MLE is

μ^=1n∑i=1nxi,σ2^=1n∑i=1n(xi−μ^)2\hat{\mu} = \frac{1}{n}\sum_{i=1}^nx_i, \quad \hat{\sigma^2}=\frac{1}{n}\sum_{i=1}^n(x_i-\hat{\mu})^2μ^​=n1​i=1∑n​xi​,σ2^=n1​i=1∑n​(xi​−μ^​)2 Bias(μ^)=E[μ^]−μ=E[1n∑i=1nxi]−μ=1n∑i=1nE[xi]−μ=μ−μ=0.\text{Bias}(\hat{\mu}) = \mathbb{E}[\hat{\mu}] - \mu = \mathbb{E}\left[\frac{1}{n}\sum_{i=1}^n x_i\right] - \mu = \frac{1}{n}\sum_{i=1}^n \mathbb{E}[x_i] - \mu = \mu - \mu = 0.Bias(μ^​)=E[μ^​]−μ=E[n1​i=1∑n​xi​]−μ=n1​i=1∑n​E[xi​]−μ=μ−μ=0. Var(μ^)=E[(μ^−μ)2]=E[(1n∑i=1nxi−μ)2]=E[1n2(∑i=1n(xi−μ))2]\text{Var}(\hat{\mu}) = \mathbb{E}\bigl[(\hat{\mu} - \mu)^2\bigr] = \mathbb{E}\left[\left(\frac{1}{n}\sum_{i=1}^n x_i - \mu\right)^2\right] = \mathbb{E}\left[\frac{1}{n^2}\left(\sum_{i=1}^n (x_i - \mu)\right)^2\right]Var(μ^​)=E[(μ^​−μ)2]=E​(n1​i=1∑n​xi​−μ)2​=E​n21​(i=1∑n​(xi​−μ))2​ =E[1n2(∑i=1n(xi−μ)2+∑i,j=1i≠jn(xi−μ)(xj−μ))]= \mathbb{E}\left[\frac{1}{n^2}\left(\sum_{i=1}^n (x_i - \mu)^2+ \sum_{\substack{i,j=1 \\ i \neq j}}^n (x_i - \mu)(x_j - \mu)\right)\right] =E​n21​​i=1∑n​(xi​−μ)2+i,j=1i=j​∑n​(xi​−μ)(xj​−μ)​​ =1n2∑i=1nE[(xi−μ)2]  +  1n2∑i,j=1i≠jnE[(xi−μ)(xj−μ)].= \frac{1}{n^2} \sum_{i=1}^n \mathbb{E}[(x_i - \mu)^2] \;+\; \frac{1}{n^2} \sum_{\substack{i,j=1 \\ i \neq j}}^n \mathbb{E}[(x_i - \mu)(x_j - \mu)].=n21​i=1∑n​E[(xi​−μ)2]+n21​i,j=1i=j​∑n​E[(xi​−μ)(xj​−μ)].

Notice that the first term contains the variance inside the summation. The second summation is the producct of two independent events (xix_ixi​ and xjx_jxj​ where the covariance = 0), so it will be 0.

E[(xi−μ)(xj−μ)]=0\mathbb{E}\left[(x_i - \mu)(x_j - \mu)\right] = 0E[(xi​−μ)(xj​−μ)]=0

We can sumplify the equation above further

=1n2∑i=1nE[(xi−μ)2]+0= \frac{1}{n^2} \sum_{i=1}^n \mathbb{E}[(x_i - \mu)^2] + 0=n21​i=1∑n​E[(xi​−μ)2]+0 =nσ2n2=σ2n= \frac{n\sigma^2}{n^2} = \frac{\sigma^2}{n}=n2nσ2​=nσ2​

As n→∞n \rightarrow \inftyn→∞, MSE(μ^)→0  ⟹  \text{MSE}(\hat{\mu}) \rightarrow 0 \impliesMSE(μ^​)→0⟹ consistent Bias(μ^)=0  ⟹  \text{Bias}(\hat{\mu})=0 \impliesBias(μ^​)=0⟹ unbiased

Now let's quickly analyze σ^2\hat{\sigma}^2σ^2 E[σ^2]≠σ^2\mathbb{E}[\hat{\sigma}^2] \neq \hat{\sigma}^2E[σ^2]=σ^2 so our estimator is biased. However, it is asymtotically unbiased because Bias(σ^2)→0,n→∞\text{Bias}(\hat{\sigma}^2) \rightarrow 0, \quad n \rightarrow \inftyBias(σ^2)→0,n→∞. Var(σ^2)→0,n→∞\text{Var}(\hat{\sigma}^2) \rightarrow 0, \quad n \rightarrow \inftyVar(σ^2)→0,n→∞, and MSE(σ^2)→0,n→∞\text{MSE}(\hat{\sigma}^2) \rightarrow 0, \quad n \rightarrow \inftyMSE(σ^2)→0,n→∞, so it is consistent.

Side Note: It is possible to have find an unbiased estimator for σ^2\hat{\sigma}^2σ^2. Consider the following alternative estimator S^2\hat{\text{S}}^2S^2. It is a sample statistic in place of the population parameter that incorporates Bessel's correction (n−1n - 1n−1 instead of nnn) to compensate for the fact that μ^\hat{\mu}μ^​ is an estimate from the sample and not the true mean. Without it, the sample variance would systematically underestimate the true variance:

S^2=(1n−1)∑i=1n(xi−μ^)2  ⟹  E[S^2]=σ2\hat{\text{S}}^2 = \left(\frac{1}{n-1}\right) \sum_{i=1}^{n} (x_i - \hat{\mu})^2 \implies \mathbb{E}[\hat{S}^2] = \sigma^2S^2=(n−11​)i=1∑n​(xi​−μ^​)2⟹E[S^2]=σ2

Which is unbiased.

Consistent Consistency in MLE

One thing to note is that Maximum Likelihood Estimation is almost always consistent. Why?

The short answer is that the distribution of the observations becomes more and more similar to population distribution. However, we can give a more rigorous proof be relating MLE to Kullback-Leibler (KL) Divergence. If you don't know what this is, it is a metric used to compare 2 different distributions.

KL(q∥p)=Eq[log⁡q(x)−log⁡p(x)]=Eq[log⁡(q(x)p(x))].\mathrm{KL}(q \parallel p) = \mathbb{E}_{q}\big[\log q(x) - \log p(x)\big] = \mathbb{E}_{q}\big[\log \left(\frac{q(x)}{p(x)}\right)\big].KL(q∥p)=Eq​[logq(x)−logp(x)]=Eq​[log(p(x)q(x)​)].

Where qqq is the true population distribution and ppp is our model's distribution.

KL Divergence has the following properties:

  • KL(q∥p)≥0KL(q \parallel p) \geq 0KL(q∥p)≥0 for any qqq and ppp.
  • KL(q∥p)=0KL(q \parallel p) = 0KL(q∥p)=0 if and only if q=pq = pq=p.
KL(q∥p)={∑xq(x)(log⁡q(x)p(x))If x is discrete∫q(x)log⁡q(x)p(x) dxIf x is continuousKL(q \parallel p) = \begin{cases} \sum_x q(x) \left( \log \frac{q(x)}{p(x)} \right) & \text{If } x \text{ is discrete} \\[10pt] \int q(x) \log \frac{q(x)}{p(x)} \, dx & \text{If } x \text{ is continuous} \end{cases}KL(q∥p)=⎩⎨⎧​∑x​q(x)(logp(x)q(x)​)∫q(x)logp(x)q(x)​dx​If x is discreteIf x is continuous​
  • KL(q∥p)≠KL(p∥q)KL(q \parallel p) \neq KL(p \parallel q)KL(q∥p)=KL(p∥q)

Let's prove that KL(q∥p)≥0KL(q \parallel p) \geq 0KL(q∥p)≥0 for any qqq and ppp.

First, recall Jensen's Inequality.

As a quick refresher of Jensen's Inequality, consider a case where we are given the PDF of qqq where where XXX is a discrete R.V. with only 2 possible points.

q(x)=δx1+δx22q(x) = \frac{\delta x_1 + \delta x_2}{2}q(x)=2δx1​+δx2​​

If some function over this distribution qqq is convex, then Eq[f(x)]≥f(Eq[x])\mathbb{E}_q[f(x)] \geq f\left(\mathbb{E}_q[x]\right)Eq​[f(x)]≥f(Eq​[x]). This is known as Jensen's Inequality.

Jensen's Inequality

This can be generalized to a distribution qqq with more than 2 points. We will use this in our proof.

Now, consider the KL divergence formula.

KL(q∥p)=Eq[log⁡(q(x)p(x))].\mathrm{KL}(q \parallel p) = \mathbb{E}_{q}\big[\log \left(\frac{q(x)}{p(x)}\right)\big].KL(q∥p)=Eq​[log(p(x)q(x)​)].

We know that the log function is concave down, but we want it to be convex so that we can apply Jensen's inequality. We want to use Jensen's inequality because it will allow us to put a lower bound on the equation.

We can make the equation convex by negating the log and swapping the numerator and denominator.

KL(q∥p)=Eq[log⁡(q(x)p(x))]=Eq[−log⁡(p(x)q(x))].\mathrm{KL}(q \parallel p) = \mathbb{E}_{q}\big[\log \left(\frac{q(x)}{p(x)}\right)\big] = \mathbb{E}_{q}\big[-\log \left(\frac{p(x)}{q(x)}\right)\big].KL(q∥p)=Eq​[log(p(x)q(x)​)]=Eq​[−log(q(x)p(x)​)].
Jensen's Inequality

Now we can apply Jensen's Inequality.

Eq[−log⁡(p(x)q(x))]≥−log⁡(Eq[p(x)q(x)])\mathbb{E}_q\left[-\log\left(\frac{p(x)}{q(x)}\right)\right] \geq -\log\left(\mathbb{E}_q\left[\frac{p(x)}{q(x)}\right]\right)Eq​[−log(q(x)p(x)​)]≥−log(Eq​[q(x)p(x)​]) =−log⁡(∑xp(x))=−log⁡(1)=0= -\log(\sum_xp(x)) = -\log(1) = \boxed{0}=−log(x∑​p(x))=−log(1)=0​

Therefore,

KL(q∥p)≥0.\mathrm{KL}(q \parallel p) \geq 0.KL(q∥p)≥0.

Now we will prove KL(q∥p)=0  ⟺  q=pKL(q \parallel p) = 0 \iff q = pKL(q∥p)=0⟺q=p.

The inequality in Jensen's inequality must be an equality for this to happen.

This would mean that Eq[f(x)]≥f(Eq[x])\mathbb{E}_q[f(x)] \geq f\left(\mathbb{E}_q[x]\right)Eq​[f(x)]≥f(Eq​[x]) where fff is the convex function. The only way for this to happen is if all the points in the distribution (x1x_1x1​ and x2x_2x2​ in the graph above where the distribution only has 2 points) don't have a gap between them so that the point getween them isn't lower.

If x1x_1x1​ and x2x_2x2​ are equal in our earlier example (or any corresponding poiints between the distributions), we can imply that p(x)q(x)\frac{p(x)}{q(x)}q(x)p(x)​ is a constant. This implies p(x)=constant⋅q(x)  ⟹  ∑xp(x)=constant∑xq(x)  ⟹  p=qp(x) = \text{constant} \cdot q(x) \implies \sum_xp(x) = \text{constant}\sum_xq(x) \implies p=qp(x)=constant⋅q(x)⟹∑x​p(x)=constant∑x​q(x)⟹p=q

Relating KL-Divergence to MLE

MLE  ⟺  KL\text{MLE} \iff \text{KL}MLE⟺KL

Assume qqq is the data distribution geven and p is the "model". Consider if p wer part of some parametric family.

Pθ,θ∈ΘP_\theta, \quad \theta \in \ThetaPθ​,θ∈Θ

Our learning problem becomes

min⁡θKL(q∥Pθ)  ⟺  min⁡θEq[log⁡q(x)−log⁡Pθ(x)]\min_\theta KL(q \parallel P_\theta) \iff \min_\theta \mathbb{E}_q\left[\log q(x) - \log P_\theta(x)\right]θmin​KL(q∥Pθ​)⟺θmin​Eq​[logq(x)−logPθ​(x)]

Notice that log⁡q(x)\log q(x)logq(x) is not dependent on θ\thetaθ, so our learning problem is actually.

max⁡θEq[log⁡Pθ(x)]\max_\theta \mathbb{E}_q\left[\log P_\theta(x)\right]θmax​Eq​[logPθ​(x)]

This is the same thing as the average log-likelihood because this is the average on qqq (the data distribution) where {xi}∼q\{x_i\} \sim q{xi​}∼q.

≈max⁡θ1n∑i=1nlog⁡Pθ(xi)=max⁡θ∑i=1nlog⁡Pθ(xi)\approx \max_\theta \frac{1}{n} \sum_{i=1}^n \log P_\theta(x_i) = \max_\theta \sum_{i=1}^n \log P_\theta(x_i)≈θmax​n1​i=1∑n​logPθ​(xi​)=θmax​i=1∑n​logPθ​(xi​)

Maximizing the average log-likelihood is the same thing as maximizing the MLE.

Conclusion

Maximum Likelihood Estimation (MLE) is a powerful and foundational statistical method for parameter estimation. By leveraging the principle of maximizing the likelihood function, MLE provides a systematic approach to infer the parameters of a model based on observed data. Its versatility is evident as it is applicable to a wide range of parametric families, from simple cases like a biased coin to more complex scenarios such as regression tasks or Gaussian distributions.

Throughout this blog post, we explored key aspects of MLE:

  1. Formal Definition: Understanding how MLE transforms parameter estimation into an optimization problem using likelihood or log-likelihood functions.
  2. Examples: From a biased coin to Gaussian and regression problems, these examples illustrated how MLE operates across different scenarios.
  3. Evaluation Metrics: Metrics like bias, variance, and MSE provide insights into the behavior and reliability of MLE estimators.
  4. Log-Odds and Numerical Methods: The flexibility of working with log-odds and the necessity of numerical solutions for complex models.
  5. Consistency of MLE: Demonstrating the nearly universal consistency of MLE, supported by a rigorous connection to Kullback-Leibler (KL) divergence.

Takeaways

  • MLE is intuitive and adaptable, making it a go-to method for parameter estimation across various domains.
  • Despite its power, MLE is not without limitations. It assumes that the chosen model accurately represents the underlying data distribution, and it can be sensitive to outliers or model misspecification.
  • The bias-variance tradeoff and consistency properties highlight the importance of understanding the behavior of MLE estimators, especially as sample sizes grow.

Practical Implications

Whether you're building predictive models, designing experiments, or analyzing data, MLE provides a strong statistical foundation. Its principles extend beyond traditional parametric models and are central to many machine learning algorithms.

By mastering MLE, you unlock a tool that combines theoretical rigor with practical applicability, empowering you to tackle a wide array of problems with confidence.

Back to Blog Dashboard

Maximum Likelihood Estimation (MLE)

How to Use the Visualization

Theory

Main Idea

Quick Review

Illustative Example - Biased Coin

Formal Definition

Biased Coin Continued...

Log-Odds

Example - MLE for a Gaussian Distribution

Example - MLE in Regression

Why does it make sense to model the regression task this way?

Example - MLE For Logistic Regression

Evaluation Metrics and MLE

Example - Calculating MLE Evaluation Metrics For a Gaussian

Consistent Consistency in MLE

Conclusion

Takeaways

Practical Implications