How do we “look” at features and high-dimensional examples?
- Summary statistics
- Categorical Features
- Frequencies
- Mode
- Quantiles
- Numerical Features
- Location
- Mean
- Median
- Quantiles
- Spread
- Range
- Variance
- Interquartile ranges
- Location
- Entropy: measured “randomness” of a set of variables where entropy is and is the proportion of times you have value , range from
- Low entropy means it is very predictable whereas high entropy means it is very unpredictable (roughly, spread)
- Normal distribution has the highest entropy
- Not always representative! Don’t mistake the map for the territory
- Categorical Features
- Distance or similarities
- Hamming distance: number of times elements aren’t equal
- Euclidian distance: how far apart are the vectors (square root of sum of squares)
- Correlation
- Jaccard coefficient: set distance, intersection over union
- Edit distance: for strings, how many characters do I need to change to go from one to the other
- Distance in latent space
- Visualizations
- Basic line plots
- Matrix plot: visualize two features as an image
- Correlation plot
- Can add colour to show a third feature (usually categorical)
- Scatterplot