An example of a probabilistic classifier. Commonly used in spam filters (classifies as spam if the probability of spam is higher than not spam)

To model this, it uses Bayes rule:

$P(y_{i}=spam∣x_{i})=P(x_{i})P(x_{i}∣y_{i}=spam)P(y_{i}=spam) $

Where

- $P(y_{i}=spam)$ is the marginal probability that an e-mail is spam
- $P(x_{i})$ is the marginal probability than an e-mail has the set of words $x_{i}$
- Hard to approximate (lots of ways to combine words)

- $P(x_{i}∣y_{i}=spam)$ is the conditional probability that a spam e-mail has the words $x_{i}$

## Optimizations

### Denominator doesn’t matter

We can actually reframe this to avoid calculating $P(x_{i})$ as Naive Bayes just returns spam if $P(y_{i}=spam∣x_{i})>P(y_{i}=not spam∣x_{i})$

Roughly, denominator doesn’t matter

$∝P(x_{i}∣y_{i}=spam)P(y_{i}=spam)$

### Conditional Independent Assumptions

Additionally, we assume that *all* features $x_{i}$ are conditionally independent given label $y_{i}$ so we can decompose it.

$≈∏_{j=1}P(x_{ij}∣y_{i})P(y_{i})$

## Laplace Smoothing

If we have no spam messages with lactase, then $P(lactase∣spam)=0$ so spam messages with lactase automatically get through!

Our estimate of $P(lactase∣spam)=0$ is $# spam messages# spam messages with lactase =# spam messages0 $

We can add $β$ to the numerator and $βk$ to the denominator, which effectively adds $βk$ fake examples: $β$ for each $k$ where $k$ is a possible class (2 for a binary classifier)

So for our binary spam classifier (with $β=1$):

$# spam messages+2# spam messages with lactase+1 $