**ensemble method**.

## Ensemble method

The ensemble method itself is not a specific method or algorithm, but an idea for training machine learning models. It has only one meaning, which is to train multiple models and then aggregate their results. According to this idea, the industry has derived three specific methods, namely Bagging, Boosting, and Stacking.**Bagging**Bagging is the abbreviation of bootstrap aggregating. We literally have a hard time understanding its meaning. We only need to remember this name. In the Bagging method, we will create K data sets by putting back random sampling. For each data set, there may be some single samples repeated, or some samples have never appeared, but overall, the probability of each sample is the same. After that, we use the sampled K data sets to train K models. The model here is not limited. We can use any machine learning square model. K models will naturally get K results. Then we adopt the method of democratic voting to aggregate these K models. For example, suppose K=25, in a binary classification problem. Ten model predictions are 0, and 15 model predictions that are 1. Then the final prediction result of the entire model is 1, which is equivalent to the democratic voting of K models. The famous random forest is this way.

**Boosting**The idea of Boosting is very similar to Bagging. They are consistent with the sampling logic of the sample. The difference is that in Boosting, the K models are not trained simultaneously, but serially. Each model will be based on the results of the previous model when training, and pay more attention to the samples that are judged wrong by the previous model. Similarly, the sample will also have a weight, and the sample with the larger error judgment rate will have a larger weight. And each model will be given different weights according to their different capabilities, and finally, all models will be weighted and summed instead of fair voting. Due to this mechanism, the efficiency of the model during training is also different. Because all models of Bagging are entirely independent, we can take distributed training. Each model in Boosting depends on the previous model’s effect, so it can only be trained in series.

**Stacking**Stacking is a method often used in Kaggle competitions, and its idea is straightforward. We choose K different models and then use cross-validation to train and predict on the training set. Ensure that each model produces a prediction result for all training samples. Then for each training sample, we can get K results. After that, we create a second layer model, and its training features are these K results. In other words, the structure of the multi-layer model will be used in the Stacking method. The last layer of the model’s training features is the results of the upper layer model’s prediction. It is up to the model to train which model’s results are more worthy of adoption, and how to combine the strengths between the models. The AdaBoost we introduced today, as the name suggests, is a classic Boosting algorithm.

## Modeling Ideas

The core idea of AdaBoost is to build a robust classifier through some weak classifiers by using the Boosting method. We all understand the robust classifier: the model with reliable performance, so how should the weak classifier be recognized? The strength of the model is defined relative to the random result. The better the model than the random outcome, the more reliable it is its performance. From this point of view, a weak classifier is a classifier that is only slightly stronger than a random result. Our goal is to design the weights of samples and models so that the best decisions can be made, and these weak classifiers can be synthesized into the effects of robust classifiers. First, we will assign a weight to the training samples. In the beginning, the weights of each sample are equal. A weak classifier is trained based on the training samples, and the error rate of this classifier is calculated. Then train the weak classifier again on the same data set. In the second training, we will adjust the weight of each sample. The correct sample weight will be reduced, and the wrong sample weight will be increased. Similarly, each classifier will also be assigned a weight value, the higher the weight, the greater its speech power. These are calculated based on the error rate of the model. The error rate is defined as D over D’. D here means that the data set represents the set of classification errors, which is equal to the number of samples in the error classification divided by the total number of samples. With the error rate, we can get it according to the following formula. After we got it, we used it to update the weight of the sample, and the weight of the correct classification was changed to epsilon. The weight of the misclassified sample is changed to the delta. In this way, all our weights have been updated, which completes a round of iterations. AdaBoost will repeatedly iterate and adjust the weights until the training error rate is 0, or the number of weak classifiers reaches the threshold.## Source Code

First, let’s get the data. Here we choose the breast cancer prediction data in the sklearn dataset. Like the previous example, we can directly import it and use it, which is very convenient:```
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
breast = load_breast_cancer()
X, y = breast.data, breast.target
# reshape
y = y.reshape((-1, 1))
```

```
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=23)
```

```
def loss_error(y_pred, y, weight):
return weight.T.dot((y_pred != y_train))
def stump_classify(X, idx, threshold, comparator):
if comparator == 'lt':
return X[:, idx] <= threshold
else:
return X[:, idx] > threshold
def get_thresholds(X, i):
min_val, max_val = X[:, i].min(), X[:, i].max()
return np.linspace(min_val, max_val, 10)
```

```
def build_stump(X, y, weight):
m, n = X.shape
ret_stump, ret_pred = None, []
best_error = float('inf')
for i in range(n):
for j in get_thresholds(X, i):
for c in ['lt', 'gt']:
pred = stump_classify(X, i, j, c).reshape((-1, 1))
err = loss_error(pred, y, weight)
if err < best_error:
best_error, ret_pred = err, pred.copy()
ret_stump = {'idx': i, 'threshold': j, 'comparator': c}
return ret_stump, best_error, ret_pred
def adaboost_train(X, y, num_stump):
stumps = []
m = X.shape[0]
weight = np.ones((y_train.shape[0], 1)) / y_train.shape[0]
for i in range(num_stump):
best_stump, err, pred = build_stump(X, y, weight)
alpha = 0.5 * np.log((1.0 - err) / max(err, 1e-10))
best_stump['alpha'] = alpha
stumps.append(best_stump)
for j in range(m):
weight[j] = weight[j] * (np.exp(-alpha) if pred[j] == y[j] else np.exp(alpha))
weight = weight / weight.sum()
if err < 1e-8:
break
return stumps
```

```
def adaboost_classify(X, stumps):
m = X.shape[0]
pred = np.ones((m, 1))
alphs = 0.0
for i, stump in enumerate(stumps):
y_pred = stump_classify(X, stump['idx'], stump['threshold'], stump['comparator'])
pred = y_pred * stump['alpha']
alphs += stump['alpha']
pred /= alphs
return np.sign(pred).reshape((-1, 1))
```