Search

Exploratory data analysis (EDA)

Last updated Sep 9, 2022 Edit Source

How do you “look” at features and high-dimensional examples?

1. Summary statistics
• Categorical Features
• Frequencies
• Mode
• Quantiles
• Numerical Features
• Location
• Mean
• Median
• Quantiles
• Range
• Variance
• Interquartile ranges
• Entropy: measured “randomness” of a set of variables where entropy is $- \Sigma_{c=1}^k p_c \log p_c$ and $p_c$ is the proportion of times you have value $c$, range from $[0, \log k]$
• Low entropy means it is very predictable whereas high entropy means it is very unpredictable (roughly, spread)
• Normal distribution has the highest entropy
• Not always representative! Don’t mistake the map for the territory
2. Distance or similarities
• Hamming distance: number of times elements aren’t equal
• Euclidian distance: how far apart are the vectors (square root of sum of squares)
• Correlation
• Jaccard coefficient: set distance, intersection over union
• Edit distance: for strings, how many characters do I need to change to go from one to the other
• Distance in latent space
3. Visualizations
• Basic line plots
• Matrix plot: visualize two features as an image
• Correlation plot
• Can add colour to show a third feature (usually categorical)
• Scatterplot