Decision Trees
Regression and Classification Trees
Assume we are trying to predict the salary of a baseball player based on two characteristics: the number of years they played in the major leagues, and the number of hits they had last season. A possible tree that we might end up with can be seen in the figure below. There are two places where the tree is split: Years<4.5 and Hits<117.5. These are the tree’s internal nodes. Incidentally, we call the split conditions Years<4.5 and Hits<117.5 ’splitting rules’.

Pruning
An important question to ask when building a tree is when to stop. Theoretically, a tree can have a terminal node for every single data point in the entire training set and obtain 100% training accuracy; in this case, we will almost certainly overfit to the training set. Pruning is a technique that mitigates this problem by decreasing the number of terminal nodes by removing branches. Naturally, we should remove branches such that the error rate increases by a minimum. One method that accomplishes this is called cost complexity pruning, which assumes an augmented loss function, $$\sum_{m=1}^{|T|} \sum_{x_i \in R_m} (y_i - \hat{y}_{R_m})^2 + \alpha |T|$$ where |T| is the number of terminal nodes in the tree, \(R_m\) is the region indexed by \(m\), \(y_i\) are the training observations, \(\hat{y}_{R_m}\) is the estimate for region \(R_m\) and \(\alpha\) is a non-negative tuning parameter. The additional \(\alpha\)|T| term penalizes trees with many terminal nodes, meaning very deep trees will have a higher loss, despite it having better prediction accuracy on the training set. Since \(\alpha\) is not fixed, it is typical to learn it using cross validation. This is accomplished through a grid search to identify values of \(\alpha\) which result in a sequence of successively shallower trees. Then, cross validation can be performed on each individual tree to determine which is optimal. See Figure 3 for an example of pruning.Pros vs. Cons of Trees
As mentioned in the beginning, trees are easy to interpret. Furthermore, the way trees make predictions is thought to mirror human decision-making. Trees can also be displayed graphically - a desirable trait for presentations in real world applications. Despite these advantages, trees suffer from lower prediction accuracy, high variance, and lack of robustness (a small change in the data leads to a big change in the resulting tree). See this article if you are interested in learning about extensions to tree-based methods which address these problems.