Data mining is a way to generate new information by combining facts found in multiple transactions, and it can also be a way to predict future events.
Typical steps of data mining
- Learn about the application
- Identify data mining task
- Collect data
- Clean and preprocess the data
- Transform data or select useful subsets
- Choose data mining algorithm
- Data mining
- Evaluate visualize and interpret results
- Use results for profit or other goals
In a table
- a row is an example or sample
- a column is a feature
Feature types:
- Categorical
- binary
- nominal: name-like
- Numerical (counts, ordinal, continuous)
- Allows us to interpret examples in points in feature space
Ways to approximate other data with numerical features
- Text:
- Bag of words: word counts
- Images: gray-scale intensity
- Graphs: adjacency matrix
Data can not be clean when data is
- duplicated
- missing
- full of outliers
- noisy
Coupon collector problem: you generally need to see samples to see all n possible values which have equal probabilities