1. Introduction
Adaptive Boosting (AdaBoost) is a boosting algorithm proposed by Yoav Freund and Robert Schapire.
2. Algorithm
Input: a training dataset ${(x_i, y_i)}_{i=1}^n$.
Step 1: Initialize weights $w_{1,1}, …, w_{n,1}$ with a distinct value $\frac{1}{n}$.
Step 2: For $m=1$ to $M$ (maximum number of trees):
(A) Make a stump $h_m$ from a feature that can yield the least total error $\epsilon_m$ where:
\begin{equation} \epsilon_m = \sum_{i=1}^n w_{i,m} \tag{where $h_m(x_i) \neq y_i$} \end{equation}
(B) Calculate the influence of this stump $h_m$ on the final model:
\begin{equation} \alpha_m = \frac{1}{2} log(\frac{1-\epsilon_t}{\epsilon_t}) \end{equation}
(C) Update weights for samples depending on if they were correctly or incorrectly classified by $h_m$:
\begin{equation} w_{incorrect} = w * e^{\alpha_m} \end{equation}
\begin{equation} w_{correct} = w * e^{-\alpha_m} \end{equation}
(D) Normalize the sample weights such that $\sum w_{i, t+1} = 1$.
Output: The final model:
\begin{equation} F(x) = \sum_{m=1}^M \alpha_m h_m(x) \end{equation}
3. Algorithm Exposition
3.1. Step 1
At the beginning of the algorim, we have to initialize an equal amount of weights $\frac{1}{n}$ for all training samples.
Later, we will adjust weights such that incorrectly classified samples will receive a higher weight while correctly classified samples will receive a lower weights.
3.2. Step 2
(A)
In this step, for each feature in the training set, we make a stump. We will choose a stump that yield the least total error. The calculation of the total error $\epsilon_m$ is straightforward: It is just the sum of weights of all incorrectly classified training samples.
(B)
The total error $\epsilon_m$ is used to compute the “amount of say”, i.e. how much the current stump $h_m$ influence the final model.
The calculate of $\alpha_m$ is also straightforward.
(C)
We adjust the weight for each training sample depending on how $h_m$ classified.
As said above, we update weights such that we put a greater emphasis on incorrectly classified samples and a lesser on correctly classified samples .
(D)
After updating weights, we have to normalize all weights such that the sum of all weights equals to 1.
At the end of (E) step, we repeat Step 2 again for $h_{m+1}$.
Output
So how the final model $F(x)$ will output? Specifically, the final model takes into account the prediction of each stump as well as its amount of say. For example, suppose there are $12$ stumps that predict YES with the total amount of say $34$, while there are $15$ stumps that predict NO with the total amount of say $21$. The output of the final model will be YES.