Data Distributions

Machine Learning and AI Systems excludes the tail ends of the distributions. Synthetic/generative/federated models suck at these. And, for a lot of industries, the most interesting cases are outliers (esp. in medical AI)

This is where minorities live, and the result is that most ML systems end up reproducing existing systems of power (re: To live in their Utopia, Matthew Effect)

Overfitting

Does “not trying to overfit” mean we perform badly on some groups?

If you have 99% “Group A” in your dataset, model can do well on average by only focusing on Group A
Treat the other 1% as outliers
Doing well at test-time might mean ignoring outliers and minorities

Contextual Data

Should data and information be contextualized all the time?

Context is important when dealing with historical data. Knowing why certain decisions were made is extremely important
We want data to be anonymized to a certain extent. Exposing patient data, for example, is a huge risk.

How do we choose what context to include and what not to include?

jzhao.xyz

Recent Writing

2024: Centering

Taste is a guide for what is worthwhile

Agentic Computing

Building a BFT JSON CRDT

Recent Notes

TrueTime

Concurrency control