Decision Tree
A simple program consisting of ifelse decisions (decision stumps) based on the features.
 We can create a bunch of decision stumps and define a “score” for each possible rule.
 An intuitive way of thinking about this is as classification accuracy: “if we use this rule, how many examples do we label correctly?”
 We can create a tree using greedy recursive splitting: using a sequence of stumps to fit a tree
 However, accuracy isn’t perfect. Sometimes splitting doesn’t immediately improve accuracy. We can get around this by generally selecting feature test that maximizes information gain
 Entropy of set $S$ of data samples is defined as $H(s) =  \sum_{c \in C}p(c)\log(p(c))$, where $C$ is the set of classes represented in $S$ and $p(c)$ is the empirical distribution of class $c$ in $S$.
 Generally, select feature test that maximizes information gain: $I = H(S)  \sum_{i \in {children}}\frac{S^i}{S}H(S^i)$


# Tradeoffs
Advantages
 Easy to implement
 Interpretable
 Learning is fast, prediction is very fast
 Can elegantly handle a small number of missing values during training Disadvantages
 Had to find optimal set of rules
 Greedy splitting often not accurate, can require very deep trees
 Can only express ‘and’ relations, not OR
# Pseudocode for choosing decision stumps

