Machine Learning and AI Systems excludes the tail ends of the distributions. Synthetic/generative/federated models suck at these. And, for a lot of industries, the most interesting cases are outliers (esp. in medical AI)
Does “not trying to overfit” mean we perform badly on some groups?
- If you have 99% “Group A” in your dataset, model can do well on average by only focusing on Group A
- Treat the other 1% as outliers
- Doing well at test-time might mean ignoring outliers and minorities
Should data and information be contextualized all the time?
- Context is important when dealing with historical data. Knowing why certain decisions were made is extremely important
- We want data to be anonymized to a certain extent. Exposing patient data, for example, is a huge risk.
How do we choose what context to include and what not to include?