# Exploratory data analysis (EDA)

How do you “look” at features and high-dimensional examples?

- Summary statistics
- Categorical Features
- Frequencies
- Mode
- Quantiles

- Numerical Features
- Location
- Mean
- Median
- Quantiles

- Spread
- Range
- Variance
- Interquartile ranges

- Location
- Entropy: measured “randomness” of a set of variables where entropy is $- \Sigma_{c=1}^k p_c \log p_c$ and $p_c$ is the proportion of times you have value $c$, range from $[0, \log k]$
- Low entropy means it is very predictable whereas high entropy means it is very unpredictable (roughly, spread)
- Normal distribution has the
*highest*entropy

- Not always representative! Don’t mistake the map for the territory

- Categorical Features
- Distance or similarities
- Hamming distance: number of times elements aren’t equal
- Euclidian distance: how far apart are the vectors (square root of sum of squares)
- Correlation
- Jaccard coefficient: set distance, intersection over union
- Edit distance: for strings, how many characters do I need to change to go from one to the other
- Distance in latent space

- Visualizations
- Basic line plots
- Matrix plot: visualize two features as an image
- Correlation plot
- Can add colour to show a third feature (usually categorical)

- Scatterplot