XGBoost 重要参数、方法、函数理解及调参思路（附例子）

您所在的位置：网站首页 › realme50x参数 › XGBoost 重要参数、方法、函数理解及调参思路（附例子）

XGBoost 重要参数、方法、函数理解及调参思路（附例子）

2024-07-01 13:42| 来源: 网络整理| 查看: 265

文章目录一、xgboost 原生接口重要参数训练参数预测函数绘制特征重要性分类例子回归例子二、xgboost 的 sklearn 风格接口XGBClassifier基本使用例子 XGBRegressor基本使用例子三、xgboost 调参思路 xgboost 包含原生接口和 sklearn 风格接口两种，并且二者都实现了分类和回归的功能。如果想了解一些理论性的内容，可以看看之前的文章： XGBoost算法的相关知识

一、xgboost 原生接口重要参数

1，booster

用于指定弱学习器的类型，默认值为 ‘gbtree’，表示使用基于树的模型进行计算。还可以选择为 ‘gblinear’ 表示使用线性模型作为弱学习器。

推荐设置为 ‘gbtree’，本文后面的相关参数设置都以booster设置为’gbtree’为前提。

2，eta / learning_rate

如果你看了我之前发的XGBoost算法的相关知识，不难发现XGBoost为了防止过拟合，引入了"Shrinkage"的思想，即不完全信任每个弱学习器学到的残差值。为此需要给每个弱学习器拟合的残差值都乘上取值范围在(0, 1] 的 eta，设置较小的 eta 就可以多学习几个弱学习器来弥补不足的残差。

在XGBClassifier与XGBRegressor中，对应参数名为 learning_rate。

推荐的候选值为：[0.01, 0.015, 0.025, 0.05, 0.1]

3，gamma

指定叶节点进行分支所需的损失减少的最小值，默认值为0。设置的值越大，模型就越保守。

**推荐的候选值为：[0, 0.05 ~ 0.1, 0.3, 0.5, 0.7, 0.9, 1] **

4，alpha / reg_alpha

L1正则化权重项，增加此值将使模型更加保守。

在XGBClassifier与XGBRegressor中，对应参数名为 reg_alpha 。

推荐的候选值为：[0, 0.01~0.1, 1]

5，lambda / reg_lambda

L2正则化权重项，增加此值将使模型更加保守。

在XGBClassifier与XGBRegressor中，对应参数名为 reg_lambda。

推荐的候选值为：[0, 0.1, 0.5, 1]

6，max_depth

指定树的最大深度，默认值为6，合理的设置可以防止过拟合。

推荐的数值为：[3, 5, 6, 7, 9, 12, 15, 17, 25]。

7，min_child_weight

指定孩子节点中最小的样本权重和，如果一个叶子节点的样本权重和小于min_child_weight则拆分过程结束，默认值为1。

推荐的候选值为：[1, 3, 5, 7]

8，subsample

默认值1，指定采样出 subsample * n_samples 个样本用于训练弱学习器。注意这里的子采样和随机森林不一样，随机森林使用的是放回抽样，而这里是不放回抽样。取值在(0, 1)之间，设置为1表示使用所有数据训练弱学习器。如果取值小于1，则只有一部分样本会去做GBDT的决策树拟合。选择小于1的比例可以减少方差，即防止过拟合，但是会增加样本拟合的偏差，因此取值不能太低。

推荐的候选值为：[0.6, 0.7, 0.8, 0.9, 1]

9，colsample_bytree

构建弱学习器时，对特征随机采样的比例，默认值为1。

推荐的候选值为：[0.6, 0.7, 0.8, 0.9, 1]

10，objective

用于指定学习任务及相应的学习目标，常用的可选参数值如下：

“reg:linear”，线性回归（默认值）。“reg:logistic”，逻辑回归。“binary:logistic”，二分类的逻辑回归问题，输出为概率。“multi:softmax”，采用softmax函数处理多分类问题，同时需要设置参数num_class用于指定类别个数

11，num_class

用于设置多分类问题的类别个数。

12，eval_metric

用于指定评估指标，可以传递各种评估方法组成的list。常用的评估指标如下：

‘rmse’，用于回归任务

‘mlogloss’，用于多分类任务

‘error’，用于二分类任务

‘auc’，用于二分类任务

13，silent

数值型，表示是否输出运行过程的信息，默认值为0，表示打印信息。设置为1时，不输出任何信息。

推荐设置为 0 。

14，seed / random_state

指定随机数种子。

在XGBClassifier与XGBRegressor中，对应参数名为 random_state 。

训练参数

以xgboost.train为主，参数及默认值如下：

xgboost.train(params, dtrain, num_boost_round=10, evals=(), obj=None, feval=None, maximize=False, early_stopping_rounds=None, evals_result=None, verbose_eval=True, xgb_model=None, callbacks=None)

1，params

字典类型，用于指定各种参数，例如：{‘booster’:‘gbtree’,‘eta’:0.1}

2，dtrain

用于训练的数据，通过给下面的方法传递数据和标签来构造：

dtrain = xgb.DMatrix(data, label=label)

3，num_boost_round

指定最大迭代次数，默认值为10

4，evals

列表类型，用于指定训练过程中用于评估的数据及数据的名称。例如：[(dtrain,‘train’),(dval,‘val’)]

5，obj

可以指定二阶可导的自定义目标函数。

6，feval

自定义评估函数。

7，maximize

是否对评估函数最大化，默认值为False。

8，early_stopping_rounds

指定迭代多少次没有得到优化则停止训练，默认值为None，表示不提前停止训练。如果设置了此参数，则模型会生成三个属性：

best_score

best_iteration

best_ntree_limit

注意：evals 必须非空才能生效，如果有多个数据集，则以最后一个数据集为准。

9，verbose_eval

可以是bool类型，也可以是整数类型。如果设置为整数，则每间隔verbose_eval次迭代就输出一次信息。

10，xgb_model

加载之前训练好的 xgb 模型，用于增量训练。

预测函数

主要是下面的两个函数：

1，predict(data)，返回每个样本的预测结果

2，predict_proba(data)，返回每个样本属于每个类别的概率

注意：data 是由 DMatrix 函数封装后的数据。

绘制特征重要性

代码如下：

from xgboost import plot_importance # 显示重要特征，model 为训练好的xgb模型 plot_importance(model) plt.show() 分类例子 from sklearn.datasets import load_iris import xgboost as xgb from xgboost import plot_importance import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score # 加载鸢尾花数据集 iris = load_iris() X,y = iris.data,iris.target # 数据集分割 X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=123457) # 参数 params = { 'booster': 'gbtree', 'objective': 'multi:softmax', 'num_class': 3, 'gamma': 0.1, 'max_depth': 6, 'lambda': 2, 'subsample': 0.7, 'colsample_bytree': 0.7, 'min_child_weight': 3, 'slient': 1, 'eta': 0.1 } # 构造训练集 dtrain = xgb.DMatrix(X_train,y_train) num_rounds = 500 # xgboost模型训练 model = xgb.train(params,dtrain,num_rounds) # 对测试集进行预测 dtest = xgb.DMatrix(X_test) y_pred = model.predict(dtest) # 计算准确率 accuracy = accuracy_score(y_test,y_pred) print('accuarcy:%.2f%%'%(accuracy*100)) # 显示重要特征 plot_importance(model) plt.show()

输出结果：

accuarcy: 93.33%

在这里插入图片描述

回归例子 import xgboost as xgb from xgboost import plot_importance from matplotlib import pyplot as plt from sklearn.model_selection import train_test_split from sklearn.datasets import load_boston from sklearn.metrics import mean_squared_error # 加载波士顿房价预测数据集 boston = load_boston() X,y = boston.data,boston.target # 数据集分割 X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=0) # 参数 params = { 'booster': 'gbtree', 'objective': 'reg:gamma', 'gamma': 0.1, 'max_depth': 5, 'lambda': 3, 'subsample': 0.7, 'colsample_bytree': 0.7, 'min_child_weight': 3, 'slient': 1, 'eta': 0.1, 'seed': 1000, 'nthread': 4, } dtrain = xgb.DMatrix(X_train,y_train) num_rounds = 300 plst = params.items() model = xgb.train(plst,dtrain,num_rounds) # 对测试集进行预测 dtest = xgb.DMatrix(X_test) ans = model.predict(dtest) print('mse:', mean_squared_error(y_test, ans)) # 显示重要特征 plot_importance(model) plt.show()

输出：

mse: 25.48099643587081

在这里插入图片描述

二、xgboost 的 sklearn 风格接口 XGBClassifier 基本使用

XGBClassifier的引入以及重要参数的默认值如下：

from xgboost import XGBClassifier # 重要参数： xgb_model = XGBClassifier(max_depth=3, learning_rate=0.1, n_estimators=100, # 使用多少个弱分类器 objective='binary:logistic', booster='gbtree', gamma=0, min_child_weight=1, max_delta_step=0, subsample=1, colsample_bytree=1, reg_alpha=0, reg_lambda=1, seed=0 # 随机数种子 )

其中绝大多数的参数在上文已经说明，不再赘述。

与原生的xgboost相比，XGBClassifier并不是调用train方法进行训练，而是使用fit方法：

xgb_model.fit( X, # array, DataFrame 类型 y, # array, Series 类型 eval_set=None, # 用于评估的数据集，例如：[(X_train, y_train), (X_test, y_test)] eval_metric=None, # 评估函数，字符串类型，例如：'mlogloss' early_stopping_rounds=None, verbose=True, # 间隔多少次迭代输出一次信息 xgb_model=None )

预测的方法有两种：

xgb_model.predict(data) # 返回预测值 xgb_model.predict_proba(data) # 返回各个样本属于各个类别的概率例子 from xgboost import XGBClassifier from sklearn.datasets import load_iris from xgboost import plot_importance import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score # 加载样本数据集 iris = load_iris() X,y = iris.data,iris.target X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=12343) model = XGBClassifier( max_depth=3, learning_rate=0.1, n_estimators=100, # 使用多少个弱分类器 objective='multi:softmax', num_class=3, booster='gbtree', gamma=0, min_child_weight=1, max_delta_step=0, subsample=1, colsample_bytree=1, reg_alpha=0, reg_lambda=1, seed=0 # 随机数种子 ) model.fit(X_train,y_train, eval_set=[(X_train, y_train), (X_test, y_test)], eval_metric='mlogloss', verbose=50, early_stopping_rounds=50) # 对测试集进行预测 y_pred = model.predict(X_test) #计算准确率 accuracy = accuracy_score(y_test,y_pred) print('accuracy:%2.f%%'%(accuracy*100)) # 显示重要特征 plot_importance(model) plt.show()

输出：

[0] validation_0-mlogloss:0.967097 validation_1-mlogloss:0.971479 Multiple eval metrics have been passed: ‘validation_1-mlogloss’ will be used for early stopping.

Will train until validation_1-mlogloss hasn’t improved in 50 rounds. [50] validation_0-mlogloss:0.035594 validation_1-mlogloss:0.204737 Stopping. Best iteration: [32] validation_0-mlogloss:0.073909 validation_1-mlogloss:0.182504

accuracy:97%

在这里插入图片描述

XGBRegressor 基本使用

XGBRegressor与XGBClassifier类似，其引入以及重要参数的默认值如下：

from xgboost import XGBRegressor # 重要参数 xgb_model = XGBRegressor( max_depth=3, learning_rate=0.1, n_estimators=100, objective='reg:linear', # 此默认参数与 XGBClassifier 不同 booster='gbtree', gamma=0, min_child_weight=1, subsample=1, colsample_bytree=1, reg_alpha=0, reg_lambda=1, random_state=0 )

其 fit 方法、predict方法与 XGBClassifier 几乎相同，不再重复说明。

例子 import xgboost as xgb from xgboost import plot_importance import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.datasets import load_boston from sklearn.metrics import mean_squared_error # 导入数据集 boston = load_boston() X ,y = boston.data,boston.target X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=0) model = xgb.XGBRegressor(max_depth=3, learning_rate=0.1, n_estimators=100, objective='reg:linear', # 此默认参数与 XGBClassifier 不同 booster='gbtree', gamma=0, min_child_weight=1, subsample=1, colsample_bytree=1, reg_alpha=0, reg_lambda=1, random_state=0) model.fit(X_train,y_train, eval_set=[(X_train, y_train), (X_test, y_test)], eval_metric='rmse', verbose=50, early_stopping_rounds=50) # 对测试集进行预测 ans = model.predict(X_test) mse = mean_squared_error(y_test,ans) print('mse:', mse) # 显示重要特征 plot_importance(model) plt.show()

输出：

[0] validation_0-rmse:21.687 validation_1-rmse:21.3558 [50] validation_0-rmse:1.8122 validation_1-rmse:4.8143 [99] validation_0-rmse:1.3396 validation_1-rmse:4.63377 mse: 21.471843729261288

在这里插入图片描述

三、xgboost 调参思路

（1）选择较高的学习率，例如0.1，这样可以减少迭代用时。

（2）然后对 max_depth , min_child_weight , gamma , subsample, colsample_bytree 这些参数进行调整。这些参数的合适候选值为：

max_depth：[3, 5, 6, 7, 9, 12, 15, 17, 25]

min_child_weight：[1, 3, 5, 7]

gamma：[0, 0.05 ~ 0.1, 0.3, 0.5, 0.7, 0.9, 1]

subsample：[0.6, 0.7, 0.8, 0.9, 1]

colsample_bytree：[0.6, 0.7, 0.8, 0.9, 1]

（3）调整正则化参数 lambda , alpha，这些参数的合适候选值为：

alpha：[0, 0.01~0.1, 1]lambda ：[0, 0.1, 0.5, 1]

（4）降低学习率，继续调整参数，学习率合适候选值为：[0.01, 0.015, 0.025, 0.05, 0.1]

参考文章：

XGBoost Parameters

Python API Reference

Python机器学习笔记：XgBoost算法

【本文地址】

公司简介

联系我们