Tree Methods

Tree-based methods are powerful, non-linear machine learning techniques widely used for their ability to capture complex interactions and non-linear relationships in data, which are often missed by traditional linear models.

I. Decision Trees (Single Trees)

A Decision Tree partitions the feature space into a set of non-overlapping regions. For any given observation, the prediction is the mean of the response values (for regression) or the most frequent class (for classification) of the training observations that fall into that region.

Splitting Criteria

The process of building a tree involves recursively splitting the data based on the feature and split point that maximizes the "purity" of the resulting nodes.

Task	Splitting Criterion (Impurity Measure)	Goal
Classification	Gini Index or Entropy/Information Gain	Maximize the reduction in impurity (heterogeneity) of the classes within the resulting nodes.
Regression	Residual Sum of Squares (RSS) or Mean Squared Error (MSE)	Minimize the variance of the response variable within the resulting nodes.

Advantages and Disadvantages

Pros: Easy to interpret (white-box model), can handle non-linear relationships, and naturally handles categorical predictors.
Cons: High variance (small changes in data can lead to a very different tree), prone to overfitting, and generally lower predictive accuracy than ensemble methods.

II. Ensemble Methods (Reducing Variance and Bias)

Ensemble methods combine multiple individual decision trees to improve overall predictive performance and robustness.

1. Bagging (Bootstrap Aggregating)

Bagging is a general-purpose procedure for reducing the variance of a statistical learning method.

Mechanism:
1. Generate $B$ bootstrap samples (sampling with replacement) from the original training data.
2. Train a full, unpruned decision tree on each bootstrap sample.
3. Aggregate the predictions: average the predictions (regression) or take a majority vote (classification).
Out-of-Bag (OOB) Error: Since each tree is trained on only about $2/3$ of the data, the remaining $1/3$ (OOB observations) can be used as a validation set to estimate the test error without the need for cross-validation.

2. Random Forests

Random Forests are an improvement over bagging that aims to decorrelate the trees, further reducing variance.

Mechanism:
1. Use the bagging procedure (bootstrap samples).
2. At each split in the tree-building process, only a random subset of $m$ predictors is considered as split candidates, where $m \ll p$ (total number of predictors).
Hyperparameter $m$ : Typically set to $\sqrt{p}$ for classification and $p/3$ for regression. By forcing the algorithm to ignore the strongest predictor in some trees, the resulting trees are less correlated, leading to a greater reduction in variance when averaged.
Feature Importance: Random Forests provide a robust measure of Variable Importance by calculating the total decrease in node impurity (e.g., Gini index) averaged over all trees.

3. Boosting

Boosting is an ensemble technique that focuses on sequentially building trees to reduce bias.

Mechanism:
1. Start with a simple model (e.g., a single tree).
2. Sequentially fit new trees to the residuals (or pseudo-residuals in generalized boosting) of the previous step. Each new tree attempts to correct the errors of the previous ensemble.
3. Each new tree's contribution is scaled by a small learning rate $\lambda$ to slow down the learning process, which improves generalization.
Key Algorithms: AdaBoost (Adaptive Boosting) and Gradient Boosting Machines (GBM), including modern implementations like XGBoost and LightGBM.
Tradeoff: Boosting generally achieves higher predictive accuracy than bagging/Random Forests but is more prone to overfitting if the learning rate is too high or the number of trees is too large.