Find observations that are unusually different from the others (aka anomaly detection).

Why? We may want to remove outliers, or be interested in the outliers themselves (security)

**Generally does not work**. It can be hard to decide when to report an outlier. There are always new ways to make outliers!

5 Types of outlier detection

- Model-based methods
- See if z-score is past a certain threshold
- Unfortunately, z-score assumes uni-modal data

- Graphical approaches
- Look at a plot, human decides if data is an outlier
- Unfortunately only in max 2-3 dimensions

- Cluster-based methods
- Cluster the data
- Find points that do not belong to clusters

- Distance-based methods
- How many points lie in a radius $ϵ$?
- Global outliers
- For each point, compute the average distance to its KNN
- Outliers are points that are far from their KNNs

- Local outliers
- Outlierness ratio of example $i$ is the average distance of $i$ to its KNN over the average distance of neighbours of $i$ to their KNNs

- Supervised-learning methods
- Use supervised learning: $y_{i}=1$ if $x_{i}$ is an outlier, $y_{i}−0$ if $x_{i}$ is a regular point
- Needs supervision: we need to know what outliers look like

## Local vs global outliers

It’s hard to precisely define “outliers”

- In the first case it was a “global” outlier.
- In this second case it’s a “local” outlier:
- Within normal data range, but far from other points.