Uses regularized regression trees. These are like decision trees where each split is

- based on a single feature
- each leaf gives a real-valued prediction

## Fitting a regression tree

- Train: set each weight $w_{L}$ at leaf $L$ by minimizing squared error $∑_{i=1}(w_{L_{i}}−y_{i})_{2}$
- We use using greedy recursive splitting for growing the tree

- Prediction: At each leaf, the prediction $y^ _{i}$ is the mean of all $y_{i}$ that fall under that leaf node

## Ensemble and Boosting

We create a series of trees that are trained on the residual of the previous tree. That is, the first tree is trained on the actual dataset. The second tree is trained on the residuals of the prediction of the first tree, and so on.

`tree[0] = fit(X,y)`

`y_hat = tree[0].predict(X)`

`tree[1] = fit(X,y - y_hat)`

- `y_hat = y_hat + tree[1].predict(X)
`tree[2] = fit(X,y - yhat)`

- `y_hat = y_hat + tree[2].predict(X)
- …

## Regularization

As long as not all $w_{L}=0$, each tree decreases training error. However, it may overfit if trees are too deep or you have too many trees

We can apply L0-regularization (stop splitting if $w_{L}=0$) and L2-regularization to discourage this.