March 14, 2021

AdaBoost

1. Introduction

Adaptive Boosting (AdaBoost) is a boosting algorithm proposed by Yoav Freund and Robert Schapire.

2. Algorithm

Input: a training dataset ${(x_i, y_i)}_{i=1}^n$.

Step 1: Initialize weights $w_{1,1}, …, w_{n,1}$ with a distinct value $\frac{1}{n}$.

Step 2: For $m=1$ to $M$ (maximum number of trees):

(A) Make a stump $h_m$ from a feature that can yield the least total error $\epsilon_m$ where:

\begin{equation} \epsilon_m = \sum_{i=1}^n w_{i,m} \tag{where $h_m(x_i) \neq y_i$} \end{equation}

(B) Calculate the influence of this stump $h_m$ on the final model:

\begin{equation} \alpha_m = \frac{1}{2} log(\frac{1-\epsilon_t}{\epsilon_t}) \end{equation}

(C) Update weights for samples depending on if they were correctly or incorrectly classified by $h_m$:

\begin{equation} w_{incorrect} = w * e^{\alpha_m} \end{equation}

\begin{equation} w_{correct} = w * e^{-\alpha_m} \end{equation}

(D) Normalize the sample weights such that $\sum w_{i, t+1} = 1$.

Output: The final model:

\begin{equation} F(x) = \sum_{m=1}^M \alpha_m h_m(x) \end{equation}

3. Algorithm Exposition

3.1. Step 1

At the beginning of the algorim, we have to initialize an equal amount of weights $\frac{1}{n}$ for all training samples.

Later, we will adjust weights such that incorrectly classified samples will receive a higher weight while correctly classified samples will receive a lower weights.

3.2. Step 2

(A)

In this step, for each feature in the training set, we make a stump. We will choose a stump that yield the least total error. The calculation of the total error $\epsilon_m$ is straightforward: It is just the sum of weights of all incorrectly classified training samples.

(B)

The total error $\epsilon_m$ is used to compute the “amount of say”, i.e. how much the current stump $h_m$ influence the final model.

The calculate of $\alpha_m$ is also straightforward.

(C)

We adjust the weight for each training sample depending on how $h_m$ classified.

As said above, we update weights such that we put a greater emphasis on incorrectly classified samples and a lesser on correctly classified samples .

(D)

After updating weights, we have to normalize all weights such that the sum of all weights equals to 1.

At the end of (E) step, we repeat Step 2 again for $h_{m+1}$.

Output

So how the final model $F(x)$ will output? Specifically, the final model takes into account the prediction of each stump as well as its amount of say. For example, suppose there are $12$ stumps that predict YES with the total amount of say $34$, while there are $15$ stumps that predict NO with the total amount of say $21$. The output of the final model will be YES.