1. Introduction
In this writing, we mathematically delve deep into XGBoost (eXtreme Gradient Boosting) algorithm proposed by Tianqi Chen.
2. Algorithm
Input: Training dataset $\{ (x_i, y_i) \}$ and a loss function $\mathcal{L}$.
Step 1: Create a leaf with a constant value.
Step 2: For $m = 1$ to $M$ (maximum number of trees):
(A) Calculate
\[g_m = \frac{\partial \mathcal{L}}{\partial f}\](B) Calculate
\[h_k = \frac{\partial^2 \mathcal{L}}{\partial f^2}\](C) Determine the structure by choosing splits with maximized gain
\[G = \frac{1}{2} \bigg[ \frac{G^2_L}{H_L} + \frac{G^2_R}{H_R} - \frac{G^2_{L+R}}{H_{L+R}} \bigg]\](D) Build tree
Output: $F(x) = \sum_{m=0}^M F_m(x)$
3. Algorithm Exposition
3.1. The General Idea
3.1. Regularized Loss Function
The loss function of XGBoost is basically similar to the loss function of Gradient Boost, except that in XGBoost, we have an additional regularization term:
\begin{equation} \bbox[15px, border:1px solid black]{ \mathcal{L} = \mathcal{O} + \Omega } \end{equation}
where $\mathcal{O}$ is our loss function and $\Omega$ is our new regularization term. The loss function $\mathcal{O}$ depends on if the task is classification or regression.
3.2. Gradient Tree Boosting
Similar to Gradient Boost, in XGBoost, we iteratively add tree to the boosted model with the goal that $\mathcal{L}$ is reduced.
Let $\hat{y}_i^{(m)}$ be the prediction of $i$-th instance at the $m$-th iteration of the algorithm, we can write our loss function as follows:
\begin{equation} \label{eq:loss-function-1} \mathcal{L^{(m)}} = \sum_{i=1}^n l(y_i, \hat{y}_i^{(m)}) + \Omega^{(m)} \end{equation}
Due to the nature of boosting algorithms, we also have $\hat{y}_i^{(m)} = \hat{y}_i^{(m-1)} + F_m(x_i)$. Therefore, we can rewrite Equation \eqref{eq:loss-function-1} as follows:
\begin{equation} \label{eq:loss-func} \bbox[15px, border:1px solid black]{ \mathcal{L^{(m)}} = \sum_{i=1}^n l \bigg[ y_i, \hat{y}_i^{(m-1)} + F_m(x_i) \bigg] + \Omega \big[ F_m(x_i) \big] } \end{equation}
3.3. Loss Function Optimisation
Before it is hard to take the derivative of $l \bigg[ y_i, \hat{y}_i^{(m-1)} + F_m(x_i) \bigg]$, we use 2nd-order Taylor Series to approximate $l$ in Equation \eqref{eq:loss-func}:
\begin{equation} \label{eq:loss-func-appr-1} l \simeq l(y, \hat{y_i}^{(m-1)}) + \bigg[ \frac{\partial}{\partial \hat{y}_i^{(m-1)}} l(y_i, \hat{y}_i^{(m-1)}) \bigg] F_m(x_i) + \frac{1}{2} \bigg[ \frac{\partial^2}{\partial^2 \hat{y}_i^{(m-1)}} l(y_i, \hat{y}_i^{(m-1)}) \bigg] F_m^2(x_i) \end{equation}
Suppose we have $g = \frac{\partial}{\partial \hat{y_i}^{(m-1)}} \mathcal{L}(y, \hat{y_i}^{(m-1)})$ and $h = \frac{\partial^2}{\partial^2 \hat{y_i}^{(m-1)}} \mathcal{L}(y, \hat{y_i}^{(m-1)})$, we can rewrite \eqref{eq:loss-func-appr-1} as follows:
\begin{equation} \label{eq:loss-func-appr-2} l \simeq l(y_i, \hat{y_i}^{(m-1)}) + g_i F_m(x_i) + \frac{1}{2} h_i F_m^2(x_i) \end{equation}
Plug Equation \eqref{eq:loss-func-appr-2} to Equation \eqref{eq:loss-func}, we can have:
\begin{equation} \mathcal{L^{(m)}} \simeq \sum_{i=1}^n \bigg[ l(y_i, \hat{y}_i^{(m-1)}) + g_i F_m(x_i) + \frac{1}{2} h_i F_m^2(x_i) \bigg] + \frac{1}{2} \lambda F_m^2(x_i) \end{equation}
Recall that our goal is to minimize $\mathcal{L^{(m)}}$ w.r.t. $F_m(x_i)$. Since $l(y_i, \hat{y_i}^{(m-1)})$ is not a function of $F_m(x_i)$, it has no effect on $\mathcal{L^{(m)}}$ w.r.t. $F_m(x_i)$, i.e. that term is a constant. Therefore, we can approximate $\mathcal{L^{(m)}}$ one more time by omitting that term:
\begin{equation} \mathcal{L^{(m)}} \simeq \sum_{i=1}^n \bigg[ g_i F_m(x_i) + \frac{1}{2} h_i F_m^2(x_i) \bigg] + \frac{1}{2} \Omega \big[ F_m(x_i) \big] \end{equation}
Suppose we have $I_j$ a set of instances in leaf $j$ of the $m$-th tree and $T$ is the total number of leaves of the $m$-th tree, our loss function now becomes:
\begin{equation} \Rightarrow \mathcal{L^{(m)}} \simeq \sum_{j=1}^T \bigg[ \big( \sum_{i \in I_j} g_i \big) w_j + \frac{1}{2} \big( \sum_{i \in I_j} h_i \big) w_j^2 \bigg] + \frac{1}{2} \lambda \sum_{j=1}^T w_j^2 \end{equation}
\begin{equation} \label{eq:loss-func-appr-3} \Rightarrow \mathcal{L^{(m)}} \simeq \sum_{j=1}^T \big( \sum_{i \in I_j} g_i \big) w_j + \frac{1}{2} \big( \sum_{i \in I_j} h_i + \lambda \big) w_j^2 \end{equation}
Our loss function $\mathcal{L^{(m)}}$ looks really neat now. We are ready to take its derivative and set to $0$ find $F_m(x_i)$ where $\mathcal{L^{(m)}}$ is minimal:
\[\begin{align} \frac{\partial \mathcal{L^{(m)}}}{\partial w_j} &= 0 \\ \Leftrightarrow \frac{\partial}{\partial w_j} \big( \sum_{i \in I_j} g_i \big) w_j + \frac{1}{2} \big( \sum_{i \in I_j} h_i + \lambda \big) w_j^2 &= 0 \\ \Leftrightarrow \big( \sum_{i \in I_j} g_i \big) + \big( \sum_{i \in I_j} h_i + \lambda \big) w_j &= 0 \\ \end{align}\]\begin{equation} \label{eq:w-j}
\Rightarrow
\bbox[15px, border:1px solid black]{
w_j = - \frac{\sum_{i \in I_j} g_i}{\sum_{i \in I_j} h_i + \lambda}
}
\end{equation}
Plug Equation \eqref{eq:w-j} into Equation \eqref{eq:loss-func-appr-3}, we obtain the optimal value of $\mathcal{L}^{(m)}$:
\[\begin{align} \mathcal{L^{(m)}} &\simeq \sum_{j=1}^T \big( \sum_{i \in I_j} g_i \big) \bigg[ \frac{\big( \sum_{i \in I_j} g_i \big)}{\big( \sum_{i \in I_j} h_i + \lambda \big)} \bigg] + \frac{1}{2} \big( \sum_{i \in I_j} h_i + \lambda \big) \bigg[ - \frac{\big( \sum_{i \in I_j} g_i \big)}{\big( \sum_{i \in I_j} h_i + \lambda \big)} \bigg]^2 \\ &= \sum_{j=1}^T - \bigg[ \frac{\big( \sum_{i \in I_j} g_i \big)^2}{\big( \sum_{i \in I_j} h_i + \lambda \big)} \bigg] + \frac{1}{2} \bigg[ \frac{\big( \sum_{i \in I_j} g_i \big)^2}{\big( \sum_{i \in I_j} h_i + \lambda \big)} \bigg] \\ &= \sum_{j=1}^T - \frac{1}{2} \bigg[ \frac{\big( \sum_{i \in I_j} g_i \big)^2}{\big( \sum_{i \in I_j} h_i + \lambda \big)} \bigg] \\ \end{align}\]\begin{equation} \label{eq:loss-min} \Rightarrow \bbox[15px, border:1px solid black]{ \mathcal{L^{(m)}} \simeq - \frac{1}{2} \sum_{j=1}^T \bigg[ \frac{\big( \sum_{i \in I_j} g_i \big)^2}{\big( \sum_{i \in I_j} h_i + \lambda \big)} \bigg] } \end{equation}
3.4. A Metric to Split Trees
\[\begin{align} \label{} Gain &= \text{Loss before split} - \text{Loss after split} \\ &= \frac{1}{2} \frac{(\sum_{i \in I_{L+R}} g_i)^2}{\sum_{i \in I_{L+R}} h_i + \lambda} - \bigg[ \frac{1}{2} \frac{(\sum_{i \in I_L} g_i)^2}{\sum_{i \in I_L} h_i + \lambda} - \frac{1}{2} \frac{(\sum_{i \in I_R} g_i)^2}{\sum_{i \in I_R} h_i + \lambda} \bigg] \\ \end{align}\]\begin{equation} \Rightarrow \bbox[15px, border:1px solid black]{ Gain = \frac{1}{2} \bigg[ \frac{(\sum_{i \in I_L} g_i)^2}{\sum_{i \in I_L} h_i + \lambda} + \frac{(\sum_{i \in I_R} g_i)^2}{\sum_{i \in I_R} h_i + \lambda} - \frac{(\sum_{i \in I_{L+R}} g_i)^2}{\sum_{i \in I_{L+R}} h_i + \lambda} \bigg] } \end{equation}
\begin{equation} \Rightarrow \bbox[15px, border:1px solid black]{ Gain = \frac{1}{2} \bigg[ \frac{G_L^2}{H_L + \lambda} + \frac{G_R^2}{H_R + \lambda} - \frac{G_{L+R}^2}{H_{L+R} + \lambda} \bigg] } \end{equation}
3.4. Regression
When XGBoost is used for regression, we could use squared loss $l = \frac{1}{2} (y_i - \hat{y_i})^2$ as our loss funciton. In such case, we have:
\begin{equation} \label{eq:g-1} \bbox[15px, border:1px solid black]{ g = \frac{\partial l}{\partial \hat{y_i}} = - (y_i - \hat{y}_i) } \end{equation}
\begin{equation} \label{eq:h-1} \bbox[15px, border:1px solid black]{ h = \frac{\partial^2 l}{\partial^2 \hat{y}_i} = 1 } \end{equation}
Plug Equations \eqref{eq:g-1} & \eqref{eq:h-1} to Equation \eqref{eq:w-j}, we have:
\begin{equation} \label{eq:w-j-regression} \Rightarrow \bbox[15px, border:1px solid black]{ w_j = \frac{\big( \sum_{i \in I_j} y_i - \hat{y}_i \big)} {\big( \sum 1 \big) + \lambda} } \end{equation}
Plug Equations \eqref{eq:g-1} & \eqref{eq:h-1} to Equation \eqref{eq:loss-min}, we have:
\begin{equation} \label{eq:optimal-loss-regression} \Rightarrow \bbox[15px, border:1px solid black]{ \mathcal{L^{(m)}} \simeq - \frac{1}{2} \sum_{j=1}^T \frac{\big[ \sum_{i \in I_j} (y_i - \hat{y}_i) \big]^2}{\big( \sum 1 \big)+ \lambda} } \end{equation}
3.5. Classification
When XGBoost is used for regression, we could use negative log loss $l = - [y_i log \hat{y}_i + (1 - y_i) log (1 - \hat{y}_i)]$ as our loss funciton. In such case, we have:
\begin{equation} \label{eq:g-2} \bbox[15px, border:1px solid black]{ g = \frac{\partial l}{\partial \hat{y_i}} = - (y_i - \hat{y}_i) } \end{equation}
\begin{equation} \label{eq:h-2} \bbox[15px, border:1px solid black]{ h = \frac{\partial^2 l}{\partial^2 \hat{y_i}} = \hat{y}_i (1 - \hat{y}_i) } \end{equation}
Plug Equations \eqref{eq:g-2} & \eqref{eq:h-2} to Equation \eqref{eq:w-j}, we have:
\begin{equation} \label{eq:f-m-classification} \Rightarrow \bbox[15px, border:1px solid black]{ w_j = \frac{\big( \sum_{i \in I_j} y_i - \hat{y_i} \big)}{\big[ \sum \hat{y}_i (1 - \hat{y}_i) \big] + \lambda} } \end{equation}
Plug Equations \eqref{eq:g-2} & \eqref{eq:h-2} to Equation \eqref{eq:loss-min}, we have:
\begin{equation} \label{eq:optimal-loss-classification} \Rightarrow \bbox[15px, border:1px solid black]{ \mathcal{L^{(m)}} \simeq - \frac{\big[ \sum_{i \in I_j} (y_i - \hat{y}_i) \big]^2}{\big[ \sum \hat{y}_i (1 - \hat{y}_i) \big] + \lambda} } \end{equation}