# GrowNet: Gradient Boosting Neural Networks

## Introduction

GrowNet is a novel gradient-boosting framework that uses gradient-boosting techniques to build complex neural networks from shallow deep neural networks. The shallow deep neural networks are used as weak learners. GrowNets today are finding applications in diverse domains and fields.

## A Brief Refresher of Gradient Boosting Algorithms.

Gradient Boosting is the technique to build models sequentially and these models try to reduce the error produced by the previous models. This is done by building a model on the residuals or errors produced by the previous model. It can estimate a function using optimization using numerical methods. The most common type of Gradient boosting function is Decision trees where each decision is modeled by fitting the negative gradient of the previous tree.

## GrowNet - A Novel Boosting idea applied to Neural Networks

The main concept or idea behind the Gradient Boosting algorithm is that it uses lower−level simpler models as building blocks to build more stronger and powerful models generally of higher order by a technique known as sequential gradient boosting using first and second order gradient derivatives. In such models, the weak learners improve the performance of the higher−order model.

At each boosting step, the initial input features are

the original input features are extended to the previous layer output for the present iteration. This merged set of features is used as an input for training the next set of weak learners using a boosting−based mechanism using the present residuals. All the outputs from the sequentially trained models are weighted and combined to give the final output.

Let’s assume that a dataset has m features and each feature has d−dimensions, then

$$\mathrm{T = {{(xi, yi)|xi ∈ R^d,yi ∈ R,|T| = m}}}$$

Assuming Grownet takes N iterations,

$$\mathrm{ŷ_i\:=\:∅(x_i)\:=\:\displaystyle\sum\limits_{n=0}^N α_n\:Fk(x_i), \: n ∈ F}$$

Where F = multiplier in space, αn = size of step .Fn represents each shallow NN with an output layer.

If l is the differentiable loss function then, the objective function to minimize the following equation

$$\mathrm{L(\epsilon)\:=\:\sum_{i=0}^nl(y_i,\hat{y}_{i})}$$

We can further add a regularizing.

Let $\mathrm{\hat{y}_{i}^{(t-1)}\:=\:\sum_{k=0}^{t-1}\alpha_kf_k(x_i)}$ is Grownet output at the t−1 stage for xi sample, then

$$\mathrm{L^{(t)}\:=\:\sum_{i=0}^nl(y_i,\hat{y}_{i}^{(t-1)}+f_i(x_i))}$$

The objective function for the weak learner will be given as.

$$\mathrm{L^{(t)}\:=\:\sum_{i=0}^n(\tilde{y_{i}}\:-\:f_i(x_i))^2}$$

where,

$$\mathrm{\tilde{y_{i}}\:=\:-gi/hi}$$

### Introduction of corrective step.

At each step (boosting stage) t, the parameters are updated for the tth weak learner and all previous ( t−1) weak learners are not changed. In this process, the model may stuck in local minima during the learning process which is alleviated by αn.So for this, we introduce a corrective step where in each corrective step each of the t−1 learners is allowed to update the parameters through backpropagation.

## Applications of GrowNets

GrowNets can be used for both Regression and Classification.

### For Regression.

An MSE loss function is employed for the regression task. If l is the mean squared loss to obtain yi with first and second order with t stages is

$$\mathrm{g_i\:=\:2(\hat{y}^{(t-1)}\:-\:y_i), \:\:\:h_i\:=\:2}$$

$$\mathrm{\tilde{y}_{i}\:=\:y_i\:-\:\hat{y}^{(t-1)}}$$

Then forthcoming weak learner is trained by least square regression for each xi,yi , i =1,,2.. and all model parameters in GrowNet are updated again using MSE loss in the corrective state.

## For Classification

In the case of a binary cross-entropy example, the cross−entropy loss function is differentiable. Taking labels yi ∈ {−1, +1} , at any point t the first-order gradient is given as

$$\mathrm{g_i\:=\:-\frac{-2y_i}{1\:+\:e^{2y_i\hat{y_i}^{(t-1)}}},\: \: h_i\:=\:\frac{4y_i^{}2e^{2y_i\hat{y_i}^{(t-1)}}}{(1\:+\:e^{-2y_i\hat{y_i}^{(t-1)}})^2}}$$

$$\mathrm{\tilde{y}_i\:=\:-g_i/h_i\:=\:y_i(1\:+\:e^{-2y_i\hat{y_i}^{(t-1)}})/2}$$

The forthcoming weak learners are fitted by regression using least squares using second−order derivatives. The parameters of all functions are finally updated using binary cross−entropy loss.

## Conclusion

GrowNet is a very new approach to using the technique of Gradient boosting applied to Deep Neural networks where we have the flexibility to do many tasks using machine learning under one framework. It is a better alternative to simple Deep Neural Networks because it gives better performance and also takes less training time.