• Overview
  • Projects
  • Blog
  • Skills
  • Education
  • Interests

Bayesian Inference

Visualization coming soon...

Theory

Main Idea

Bayesian Inference is a statistical method that updates the probability of a hypothesis as more evidence becomes available.

In this context, a hypothesis can be:

  • A binary label in classification tasks
    • e.g. classifying whether or not to trust an alarm that predicts sun explosions with high accuracy.
  • Parameter estimation
    • e.g. commute time to work
    • This is useful because it quantifies uncertainty of parameter estimation.
  • A set of parameters in regression tasks
    • e.g. predicting the feature weights of a linear regressor that predicts house prices. Note that the weights will actually be a PDF and the predicted value will also be a PDF.
    • This is useful because it also quantifies uncertainty of parameter estimation.

We will go through an example or two for each of these applications later.

The basic idea is that we assume that the data is generated from the prior belief of our parameter. We then calculate the probability distribution of our parameter conditioned on our data using baye's theorem which acts as an update to our prior belief.

Maximum Likelihood Estimator vs Bayesian Inference

Recall Maximum Likelihood Estimation:

θ^=arg max⁡θP(x∣θ)\hat{\theta} = \argmax_{\theta} P(x|\theta)θ^=argmaxθ​P(x∣θ)

In MLE, θ\thetaθ is unknown but deterministic (frequentist view).

  • By deterministic I mean that there is only one θ\thetaθ.
  • Trying to find θ^\hat{\theta}θ^ using MLE is synonomous with asking "What parameters best explain the observed data?"

Bayesian Inference, on the other hand, views θ\thetaθ as a random variable (Bayesian view) even though it is actually deterministic.

  • We can reconcile this with the fact that our random sample is, as the name implies, random which causes θ\thetaθ to act as a random variable.
  • Instead of estimating a single parameter, we estimate an distribution to give a sort of confidence to our estimate.
  • Estimating the distribution of θ\thetaθ given our data is synonomous to asking "What should we believe about the parameter(s) given the data and our prior knowledge?"

Relationship to Baye's Theorem

This estimated distribution of θ\thetaθ is actually just the posterior distribution acquired through Baye's Theorem.

P(θ∣D)=P(D∣θ)P(θ)P(D)P(\theta|D)=\frac{P(D|\theta)P(\theta)}{P(D)}P(θ∣D)=P(D)P(D∣θ)P(θ)​

Where:

P(θ∣D)P(\theta|D)P(θ∣D) - Posterior Distribution

  • i.e. the conditional probability density function (PDF) for θ\thetaθ given the data DDD.

P(D∣θ)P(D|\theta)P(D∣θ) - Likelihood

  • i.e. the conditional PDF for the data DDD given our belief of what θ\thetaθ is.

P(θ)P(\theta)P(θ) - Prior

  • i.e. the PDF for our prior belief of what θ\thetaθ is.

P(D)P(D)P(D) - Marginal Distribution

  • i.e. the PDF for the data
  • It is equivalent to ∫−∞∞P(D∣θ)P(θ) dθ\int_{-\infty}^{\infty} P(D|\theta)P(\theta) \, d\theta∫−∞∞​P(D∣θ)P(θ)dθ (Law of total probability)

The marginal distribution actually isn't required for bayesian inference.

  • This is because our data is fixed and P(D)P(D)P(D) doesn't depend on θ\thetaθ. Therefore, it is a constant and we can imply:
P(D∣θ)P(θ)∫−∞∞P(D∣θ)P(θ) dθ∝P(D∣θ)P(θ)  ⟹  P(θ∣D)∝P(D∣θ)P(θ)\frac{P(D|\theta)P(\theta)}{\int_{-\infty}^{\infty} P(D|\theta)P(\theta) \, d\theta} \propto P(D|\theta)P(\theta) \implies \boxed{P(\theta|D) \propto P(D|\theta)P(\theta)}∫−∞∞​P(D∣θ)P(θ)dθP(D∣θ)P(θ)​∝P(D∣θ)P(θ)⟹P(θ∣D)∝P(D∣θ)P(θ)​

You can think of the posterior distribution as a weighted sum of the likelihood and the prior. As n→∞n \rightarrow \inftyn→∞ (i.e. we get infinite data) the importance of the prior decreases. If nnn is close to 0, then the prior will have more influence.

Why is Bayesian Inference helpful?

In the context of deciding to trust a given classifier with low error that detects an event with low probability (which is itself a binary classification task), Bayesian Inference can be extremely useful because it helps us to balance two conflicting views:

  1. Trust the classifier because it has low error.
  2. Don't trust because the probability of a given event is so low.

Consider the following example:

Example - Binary Classification
We have a device that detects if the sun explodes with high accuracy.

We are given:


α→\alpha \rightarrowα→ Represents the error of the device. It is known and fixed (e.g. α=.0001\alpha=.0001α=.0001)
θ∈{0,1}→\theta \in \{0,1\} \rightarrowθ∈{0,1}→ indicates if the sun explodes (i.e. X=1X=1X=1 if it explodes)
X∈{0,1}→X \in \{0,1\} \rightarrowX∈{0,1}→ indicates if the alarm on the device fires (i.e. X=1X=1X=1 if it fires and predicts that the sun is exploding)
P(x=θ∣θ)=1−α→P(x=\theta|\theta) = 1 - \alpha \rightarrowP(x=θ∣θ)=1−α→ the probability that the prediction is correct
P(x=θ∣1−θ)=α→P(x=\theta|1-\theta) = \alpha \rightarrowP(x=θ∣1−θ)=α→ the probability that the prediction is incorrect


If the alarm fires (x=1x=1x=1) should we believe it?

One might initially think yes due to the extremely low error, but this can be misleading. Because the probability of the sun actually exploding is so small, a small error rate may not be enough.

  • The classifier might be good at classifying when the sun hasn't exploded but not so great at detecting when it has exploded.

Let's try to decide using MLE:

θ^=arg max⁡θ∈{0,1}P(x=1∣θ)\hat{\theta} = \argmax_{\theta \in \{0, 1\}} P(x=1|\theta)θ^=θ∈{0,1}argmax​P(x=1∣θ) P(x=1∣θ)={αif θ=01−αif θ=1P(x=1|\theta)=\begin{cases} \alpha & \text{if } \theta = 0 \\ 1 - \alpha & \text{if } \theta = 1 \end{cases}\\P(x=1∣θ)={α1−α​if θ=0if θ=1​

We have two choices in this case. Because we've already established that α\alphaα is small we know that MLE will output θ^=1\hat{\theta}=1θ^=1 because 1−α>α1-\alpha > \alpha1−α>α.

Now let's try with Bayesian Inference:

Step 1: Decide on prior probability

P(θ)={10−100000≜β if θ=1≈1−β if θ=0P(\theta) = \begin{cases} 10^{-100000} \triangleq \beta \text{ if } \theta = 1\\ \approx 1 - \beta \text{ if } \theta = 0 \end{cases}P(θ)={10−100000≜β if θ=1≈1−β if θ=0​

Step 2: Calculate posterior probability

P(θ∣x=1)=P(x=1∣θ)P(θ)P(θ)∝P(x=1∣θ)P(θ)={α(1−β)if θ=0(1−α)βif θ=1P(\theta|x=1)=\frac{P(x=1|\theta)P(\theta)}{P(\theta)} \propto P(x=1|\theta)P(\theta) = \begin{cases} \alpha(1-\beta) & \text{if } \theta = 0 \\ (1 - \alpha)\beta & \text{if } \theta = 1 \end{cases}P(θ∣x=1)=P(θ)P(x=1∣θ)P(θ)​∝P(x=1∣θ)P(θ)={α(1−β)(1−α)β​if θ=0if θ=1​

Notice that the output distribution will be the same type of distribution as both the likelihood and prior. In this case they are all Bernoulli distributions.

In this case we could decide to trust the output by calculating

θ^=arg max⁡θP(θ∣x=1)\hat{\theta}=\argmax_{\theta}P(\theta|x=1)θ^=θargmax​P(θ∣x=1)

Where we trust it if θ^=1\hat{\theta}=1θ^=1 and we don't otherwise. We trust the device   ⟺  \iff⟺

(1−α)β>α(1−β)  ⟹  β1−β>α1−α  ⟹  β>α(1-\alpha)\beta > \alpha(1-\beta)\\ \implies \frac{\beta}{1-\beta} > \frac{\alpha}{1-\alpha} \\ \implies \beta > \alpha\\(1−α)β>α(1−β)⟹1−ββ​>1−αα​⟹β>α

In our case β≈0\beta \approx 0β≈0 and α=.0001\alpha=.0001α=.0001 so we should not trust the alarm if it outputs x=1x=1x=1.


Now let's look at a more complicated example

Example - Parameter Estimation

You moved to a new apartment and your friend told you that the commute time is 30±1030\pm1030±10 munutes (prior)

You also drove yourself a few times and found the times D={25,45,30,50}D=\{25, 45, 30, 50\}D={25,45,30,50}

Your task is to predict the commute time.

Let's start with some definitions:


θ→\theta \rightarrowθ→ prior belief of the communte time
θ∼N(μ0,σ02)  ⟹  P(θ)=N(μ0,σ02)→\theta \sim \mathcal{N}(\mu_0, \sigma_0^2) \implies P(\theta) = \mathcal{N}(\mu_0, \sigma_0^2) \rightarrowθ∼N(μ0​,σ02​)⟹P(θ)=N(μ0​,σ02​)→ this is the prior distribution
μ0=30→\mu_0 = 30 \rightarrowμ0​=30→ the mean of the prior
σ0=10→\sigma_0=10 \rightarrowσ0​=10→ variance of the prior
ξi∼N(0,1)→\xi_i \sim \mathcal{N}(0, 1) \rightarrowξi​∼N(0,1)→ This is used to simulate noise in our datapoints
σ1=5 (or something else)→\sigma_1=5 \text{ (or something else)} \rightarrowσ1​=5 (or something else)→ the variance of the datapoints
D={x1,…,xn}→D=\{x_1, \dots, x_n\} \rightarrowD={x1​,…,xn​}→ our datapoints
xi=θ+σ1ξi  ⟹  P(xi∣θ)=N(θ,σ12)→x_i= \theta + \sigma_1 \xi_i \implies P(x_i|\theta) = \mathcal{N}(\theta, \sigma_1^2) \rightarrowxi​=θ+σ1​ξi​⟹P(xi​∣θ)=N(θ,σ12​)→ This is the likelihood of the data given our prior belief
N(θ,σ12)=12πσ1exp⁡(−(xi−θ)22σ12)∝exp⁡(−(xi−θ)22σ12)→\mathcal{N}(\theta, \sigma_1^2) = \frac{1}{\sqrt{2\pi}\sigma_1}\exp(-\frac{(x_i - \theta)^2}{2\sigma_1^2}) \propto \exp(-\frac{(x_i - \theta)^2}{2\sigma_1^2}) \rightarrowN(θ,σ12​)=2π​σ1​1​exp(−2σ12​(xi​−θ)2​)∝exp(−2σ12​(xi​−θ)2​)→ Proportional because the normalization constant doesn't change during inference. i.e. It's not dependent on θ\thetaθ.
P(θ∣D)=N(μp,σp2)→P(\theta |D)=\mathcal{N}(\mu_p, \sigma_p^2) \rightarrowP(θ∣D)=N(μp​,σp2​)→ our posterior distribution.


We've already decided on a prior so now we just calculate the posterior distribution.

P(θ∣D)=P(D∣θ)P(θ)P(D)∝P(D∣θ)P(θ)=∏i=1nP(xi∣θ)P(θ)P(\theta | D) = \frac{P(D|\theta)P(\theta)}{P(D)} \propto P(D|\theta)P(\theta) = \prod_{i=1}^{n} P(x_i|\theta)P(\theta) \\P(θ∣D)=P(D)P(D∣θ)P(θ)​∝P(D∣θ)P(θ)=i=1∏n​P(xi​∣θ)P(θ)

Because we know the exponent in the Gaussian is proportional to the actual PDF of the Gaussian we can say

∏i=1nP(xi∣θ)P(θ)∝[∏i=1nexp⁡(−(xi−θ)22σ12)]exp⁡(−(μ0−θ)22σ02)\prod_{i=1}^{n} P(x_i|\theta)P(\theta) \propto [\prod_{i=1}^{n}\exp(-\frac{(x_i - \theta)^2}{2\sigma_1^2})]\exp(-\frac{(\mu_0 - \theta)^2}{2\sigma_0^2})i=1∏n​P(xi​∣θ)P(θ)∝[i=1∏n​exp(−2σ12​(xi​−θ)2​)]exp(−2σ02​(μ0​−θ)2​)

We then move the product inside of the exponent

=exp⁡(−[∑i=1n(xi−θ)22σ12]−(μ0−θ)22σ02)=\exp(-[\sum_{i=1}^{n}\frac{(x_i - \theta)^2}{2\sigma_1^2}]-\frac{(\mu_0 - \theta)^2}{2\sigma_0^2})=exp(−[i=1∑n​2σ12​(xi​−θ)2​]−2σ02​(μ0​−θ)2​)

We then assume that we can write the term inside the exponent as some quadratic function

=exp⁡(−12(A−2Bθ+C))=\exp(-\frac{1}{2}(A - 2B\theta+C))=exp(−21​(A−2Bθ+C))

Where
A=∑i=1n1σ12+1σ02=nσ12+1σ02A=\sum_{i=1}^{n}\frac{1}{\sigma_1^2}+\frac{1}{\sigma_0^2} = \boxed{\frac{n}{\sigma_1^2}+\frac{1}{\sigma_0^2}}A=∑i=1n​σ12​1​+σ02​1​=σ12​n​+σ02​1​​
B=∑i=1nxiσ12+μ0σ02B=\sum_{i=1}^{n}\frac{x_i}{\sigma_1^2}+\frac{\mu_0}{\sigma_0^2}B=∑i=1n​σ12​xi​​+σ02​μ0​​
C=Some constant we don’t care aboutC = \text{Some constant we don't care about}C=Some constant we don’t care about

Then we do some more rearranging and we will find

=exp⁡(−12A(θ−BA)2+C)=\exp(-\frac{1}{2}A(\theta-\frac{B}{A})^2 + C)=exp(−21​A(θ−AB​)2+C)

Which is in the same form as the exponent we determined was proportional to the likelihood Gaussian. Therefore we can deduce that

N(BA,1A)=N(μp,σp2)=P(θ∣D)\mathcal{N}(\frac{B}{A}, \frac{1}{A}) = \mathcal{N}(\mu_p, \sigma_p^2) = P(\theta | D) \\N(AB​,A1​)=N(μp​,σp2​)=P(θ∣D)

We just found:

  • The posterior is proportional to the exponent of some quadratic function.
  • By rearranging the quadratic function, we see that it is a Gaussian with mean BA\frac{B}{A}AB​ and variance 1A\frac{1}{A}A1​.
  • This is our posterior distribution.

Recall earlier when I said that the posterior distribution is like a weighted sum of the likelihood with the prior. I said that as n→∞n \rightarrow \inftyn→∞ the prior distribution would become less and less relevant. We can illustrate this mathematically.

We know

μp=BA=[∑i=1nxiσ12]+μ0σ02nσ12+1σ02\mu_p = \frac{B}{A} = \frac{[\sum_{i=1}^n\frac{x_i}{\sigma_1^2}] + \frac{\mu_0}{\sigma_0^2}} {\frac{n}{\sigma_1^2}+\frac{1}{\sigma_0^2}}μp​=AB​=σ12​n​+σ02​1​[∑i=1n​σ12​xi​​]+σ02​μ0​​​ σp2=1A=(nσ12+1σ02)−1\sigma_p^2=\frac{1}{A}=(\frac{n}{\sigma_1^2} + \frac{1}{\sigma_0^2})^{-1}σp2​=A1​=(σ12​n​+σ02​1​)−1

First, consider when we have no data.

μp=[∑i=1nxiσ12]+μ0σ02nσ12+1σ02=μ0σp2=(nσ12+1σ02)−1=σ02\mu_p = \frac{[\sum_{i=1}^n\frac{x_i}{\sigma_1^2}] + \frac{\mu_0}{\sigma_0^2}} {\frac{n}{\sigma_1^2}+\frac{1}{\sigma_0^2}}=\mu_0\\ \sigma_p^2=(\frac{n}{\sigma_1^2} + \frac{1}{\sigma_0^2})^{-1}=\sigma_0^2μp​=σ12​n​+σ02​1​[∑i=1n​σ12​xi​​]+σ02​μ0​​​=μ0​σp2​=(σ12​n​+σ02​1​)−1=σ02​

Next, consider when nnn is really large.

μp=[∑i=1nxiσ12]+μ0σ02nσ12+1σ02≈[∑i=1nxi]+0n=Empirical Mean\mu_p = \frac{[\sum_{i=1}^n\frac{x_i}{\sigma_1^2}] + \frac{\mu_0}{\sigma_0^2}} {\frac{n}{\sigma_1^2}+\frac{1}{\sigma_0^2}} \approx \frac{[\sum_{i=1}^nx_i] + 0}{n} = \text{Empirical Mean}μp​=σ12​n​+σ02​1​[∑i=1n​σ12​xi​​]+σ02​μ0​​​≈n[∑i=1n​xi​]+0​=Empirical Mean σp2=(nσ12+1σ02)−1≈(nσ12)−1≈1n\sigma_p^2=(\frac{n}{\sigma_1^2} + \frac{1}{\sigma_0^2})^{-1} \approx (\frac{n}{\sigma_1^2})^{-1} \approx \frac{1}{n}σp2​=(σ12​n​+σ02​1​)−1≈(σ12​n​)−1≈n1​

In this example, we estimated the posterior probability distribution of communte time parameterized by μp\mu_pμp​ and σp2\sigma_p^2σp2​.

This idea of parameter estimation for the posterior distribution of some value can be extended to linear regression.

Example - Bayesian Linear Regression

Given data points {xi,yi}i=1n≜D\{x_i, y_i\}_{i=1}^n \triangleq D{xi​,yi​}i=1n​≜D, our task is to find θ s.t. y≈xTθ+b\theta \text{ s.t. } y \approx x^T\theta + bθ s.t. y≈xTθ+b or x~Tθ~\tilde{x}^T\tilde{\theta}x~Tθ~ when bbb is incorporated into θ~\tilde{\theta}θ~.

So how do we estimate θ~\tilde{\theta}θ~?
Recall that in regular linear regression we use the Least Squares objective function.

θ^=arg min⁡θ∑i=1n(yi−xiTθ)2\hat{\theta} = \argmin_{\theta}\sum_{i=1}^n(y_i - x_i^T\theta)^2θ^=θargmin​i=1∑n​(yi​−xiT​θ)2

This approach is deterministic estimation. We may also want to know the uncertainty of our estimation. We can use Bayesian Inference for this!

To do this, we follow the same steps we've been following in the previous examples with a few caveats:

  1. Treat θ\thetaθ as a random variable
  2. Assume a prior: This you can decide yourself if you have prior knowledge, or you can set a default prior.
    P(θ)=N(μ0,σ02)P(\theta)= \mathcal{N}(\mu_0, \sigma_0^2)P(θ)=N(μ0​,σ02​)
    Typical default parameters are μ0=0\mu_0=0μ0​=0 and σ02=some large number\sigma_0^2=\text{some large number}σ02​=some large number
  3. Calculate the likelihood by assuming a Gaussian model

yi=xiTθ+σ1ξiy_i = x_i^T\theta+\sigma_1\xi_iyi​=xiT​θ+σ1​ξi​

where
σ1\sigma_1σ1​: variance of the data
ξi=N(0,1)\xi_i = \mathcal{N}(0, 1)ξi​=N(0,1): Gaussian noise


By the chain rule we know
P({yi,xi}∣θ)=P(yi∣xi,θ)P(xi)P(\{y_i, x_i\} | \theta) = P(y_i|x_i,\theta)P(x_i)P({yi​,xi​}∣θ)=P(yi​∣xi​,θ)P(xi​)

Notice that P(xi)P(x_i)P(xi​) isn't dependent on θ\thetaθ. i.e. θ\thetaθ decides how you generate y_i and is the independent variable, not x_i. Therefore, instead of using the joint probability, we can use the conditional probability when calculating the likelihood.

P({yi,xi}∣θ)∝P(yi∣xi,θ)P(\{y_i, x_i\} | \theta) \propto P(y_i|x_i,\theta)P({yi​,xi​}∣θ)∝P(yi​∣xi​,θ)

Using this knowledge, let's work out all the math to calculate the likelihood.

P(θ∣D)=P(D∣θ)P(θ)P(D)∝P(D∣θ)P(θ)=[∏i=1nP({xi,yi}∣θ)]P(θ)=[∏i=1nP(xi∣yi,θ)P(xi)]P(θ)∝[∏i=1nP(xi∣yi,θ)]P(θ)∝[∏i=1nexp⁡(−(yi−xiTθ)22σ12)]exp⁡(−(θ−μ0)22σ02)=exp⁡(−[∑i=1n(yi−xiTθ)22σ12]−(θ−μ0)22σ02)P(\theta | D) = \frac{P(D|\theta)P(\theta)}{P(D)} \propto P(D|\theta)P(\theta) = [\prod_{i=1}^{n} P(\{x_i, y_i\}|\theta)]P(\theta) \\ = [\prod_{i=1}^{n} P(x_i|y_i, \theta)P(x_i)]P(\theta)\\ \propto [\prod_{i=1}^{n} P(x_i|y_i, \theta)]P(\theta)\\ \propto [\prod_{i=1}^{n} \exp(-\frac{(y_i-x_i^T\theta)^2}{2\sigma_1^2})] \exp(-\frac{(\theta - \mu_0)^2}{2\sigma_0^2})\\ = \exp(-[\sum_{i=1}^{n}\frac{(y_i-x_i^T\theta)^2}{2\sigma_1^2}]-\frac{(\theta - \mu_0)^2}{2\sigma_0^2})\\P(θ∣D)=P(D)P(D∣θ)P(θ)​∝P(D∣θ)P(θ)=[i=1∏n​P({xi​,yi​}∣θ)]P(θ)=[i=1∏n​P(xi​∣yi​,θ)P(xi​)]P(θ)∝[i=1∏n​P(xi​∣yi​,θ)]P(θ)∝[i=1∏n​exp(−2σ12​(yi​−xiT​θ)2​)]exp(−2σ02​(θ−μ0​)2​)=exp(−[i=1∑n​2σ12​(yi​−xiT​θ)2​]−2σ02​(θ−μ0​)2​)

Just like in the previous example, the equation inside the exponent is a quadratic function of θ\thetaθ.

=exp⁡(−12θTA−2BTθ+C)=\exp(-\frac{1}{2}\theta^TA-2B^T\theta+C)=exp(−21​θTA−2BTθ+C)

Because of this, we can conclude that the posterior distribution will be a Gaussian parameterized as follows:

P(y∣x,θ)=N(A−1B,A−1)P(y|x, \theta) = \mathcal{N}(A^{-1}B, A^{-1})P(y∣x,θ)=N(A−1B,A−1)

Where

A=[∑i=1nxixiTσ12]+Iσ02B=[∑i=1nyixiTσ12]+μ0σ02μp=A−1Bσp=A−1A = [\sum_{i=1}^n\frac{x_ix_i^T}{\sigma_1^2}]+\frac{I}{\sigma_0^2} \\ B= [\sum_{i=1}^n\frac{y_ix_i^T}{\sigma_1^2}]+\frac{\mu_0}{\sigma_0^2} \\ \mu_p=A^{-1}B\\ \sigma_p=A^{-1}\\A=[i=1∑n​σ12​xi​xiT​​]+σ02​I​B=[i=1∑n​σ12​yi​xiT​​]+σ02​μ0​​μp​=A−1Bσp​=A−1

If you work the math out, you will se the same properties as in the previous example that if we have no data, our posterior is the prior and that as n→∞n \rightarrow \inftyn→∞ the prior becomes negligable and we move toward the empirical means.

Conclusion

In summary, Bayesian Inference provides a powerful framework for making probabilistic predictions and quantifying uncertainty. Unlike traditional Maximum Likelihood Estimation (MLE), which seeks a single best-fit parameter, Bayesian Inference treats parameters as random variables, allowing us to capture the uncertainty in our estimates. By combining prior beliefs with observed data using Bayes' theorem, Bayesian Inference produces a posterior distribution that reflects both our prior knowledge and new evidence.

Through examples in binary classification, parameter estimation, and Bayesian linear regression, we explored how Bayesian methods incorporate prior knowledge and dynamically update beliefs as more data becomes available. This approach is especially useful in cases with limited data, where prior information significantly informs predictions. As the dataset grows, Bayesian Inference gradually relies more on the observed data, diminishing the influence of prior beliefs.

Bayesian Inference is a versatile tool that enhances predictive modeling by providing a confidence interval around predictions, helping to assess the reliability of predictions even in complex scenarios with low-probability events. This capacity for uncertainty quantification makes Bayesian approaches particularly valuable in fields where decision-making under uncertainty is crucial, from scientific research to real-world applications like risk assessment and financial forecasting.

Back to Blog Dashboard

Bayesian Inference

Theory

Main Idea

Maximum Likelihood Estimator vs Bayesian Inference

Relationship to Baye's Theorem

Why is Bayesian Inference helpful?

Conclusion