Data mining

Data mining is a way to generate new information by combining facts found in multiple transactions, and it can also be a way to predict future events.

Typical steps of data mining

Learn about the application
Identify data mining task
Collect data
Clean and preprocess the data
Transform data or select useful subsets
Choose data mining algorithm
Data mining
Evaluate visualize and interpret results
Use results for profit or other goals

In a table

a row is an example or sample
a column is a feature

Feature types:

Categorical
- binary
- nominal: name-like
Numerical (counts, ordinal, continuous)
- Allows us to interpret examples in points in feature space

Ways to approximate other data with numerical features

Text:
- Bag of words: word counts
Images: gray-scale intensity
Graphs: adjacency matrix

Data can not be clean when data is

duplicated
missing
full of outliers
noisy

Coupon collector problem: you generally need to see $O (n lo g n)$ samples to see all n possible values which have equal probabilities

jzhao.xyz

Recent Writing

2024: Centering

Taste is a guide for what is worthwhile

Agentic Computing

Building a BFT JSON CRDT

Recent Notes

TrueTime

Concurrency control