lgbm参数分析及回归超参数寻找

2024-07-13 07:15| 来源: 网络整理| 查看: 265

参考：lgbm的github: https://github.com/Microsoft/LightGBM/blob/master/docs/Parameters.rst 代码来源参见我另一篇博客： https://blog.csdn.net/ssswill/article/details/85217702 网格搜索寻找超参数：

from sklearn.model_selection import (cross_val_score, train_test_split, GridSearchCV, RandomizedSearchCV) from sklearn.metrics import r2_score from lightgbm.sklearn import LGBMRegressor hyper_space = {'n_estimators': [1000, 1500, 2000, 2500], 'max_depth': [4, 5, 8, -1], 'num_leaves': [15, 31, 63, 127], 'subsample': [0.6, 0.7, 0.8, 1.0], 'colsample_bytree': [0.6, 0.7, 0.8, 1.0], 'learning_rate' : [0.01,0.02,0.03] } est = lgb.LGBMRegressor(n_jobs=-1, random_state=2018) gs = GridSearchCV(est, hyper_space, scoring='r2', cv=4, verbose=1) gs_results = gs.fit(train_X, train_y) print("BEST PARAMETERS: " + str(gs_results.best_params_)) print("BEST CV SCORE: " + str(gs_results.best_score_)) from sklearn.model_selection import KFold from sklearn.metrics import mean_squared_error import lightgbm as lgb lgb_params = {"objective" : "regression", "metric" : "rmse", "max_depth": 7, "min_child_samples": 20, "reg_alpha": 1, "reg_lambda": 1, "num_leaves" : 64, "learning_rate" : 0.01, "subsample" : 0.8, "colsample_bytree" : 0.8, "verbosity": -1} FOLDs = KFold(n_splits=5, shuffle=True, random_state=42) oof_lgb = np.zeros(len(train_X)) predictions_lgb = np.zeros(len(test_X)) features_lgb = list(train_X.columns) feature_importance_df_lgb = pd.DataFrame() for fold_, (trn_idx, val_idx) in enumerate(FOLDs.split(train_X)): trn_data = lgb.Dataset(train_X.iloc[trn_idx], label=train_y.iloc[trn_idx]) val_data = lgb.Dataset(train_X.iloc[val_idx], label=train_y.iloc[val_idx]) print("-" * 20 +"LGB Fold:"+str(fold_)+ "-" * 20) num_round = 10000 clf = lgb.train(lgb_params, trn_data, num_round, valid_sets = [trn_data, val_data], verbose_eval=1000, early_stopping_rounds = 50) oof_lgb[val_idx] = clf.predict(train_X.iloc[val_idx], num_iteration=clf.best_iteration) fold_importance_df_lgb = pd.DataFrame() fold_importance_df_lgb["feature"] = features_lgb fold_importance_df_lgb["importance"] = clf.feature_importance() fold_importance_df_lgb["fold"] = fold_ + 1 feature_importance_df_lgb = pd.concat([feature_importance_df_lgb, fold_importance_df_lgb], axis=0) predictions_lgb += clf.predict(test_X, num_iteration=clf.best_iteration) / FOLDs.n_splits print("Best RMSE: ",np.sqrt(mean_squared_error(oof_lgb, train_y)))

输出：在这里插入图片描述开始！我们把代码拆分来看: 先看超参数字典：

hyper_space = {'n_estimators': [1000, 1500, 2000, 2500], 'max_depth': [4, 5, 8, -1], 'num_leaves': [15, 31, 63, 127], 'subsample': [0.6, 0.7, 0.8, 1.0], 'colsample_bytree': [0.6, 0.7, 0.8, 1.0], 'learning_rate' : [0.01,0.02,0.03] }

“n_estimators”: 在这里插入图片描述

从图中可看到，n_estimators是num_itertations的别名，默认是100.也就是循环次数，或者叫树的数目。后面又有一句note:对于多分类问题，树的数目是种类数*你设的树颗数。 “max_depth”:树的深度在这里插入图片描述 -1代表无限制。 “num_leaves”: “subsample”: 介绍里说了它的好处：加速训练，避免过拟合等。并说与feature_fraction类似。我们来看看这个参数是啥：原来如此，它和RF的很像。图来自：https://www.cnblogs.com/harvey888/p/6512312.html 再来看：‘colsample_bytree’：在这里插入图片描述原来就是上面的参数。继续：‘learning_rate’ 这个肯定不用多说了，学习率。

再来看这两行代码：

est = lgb.LGBMRegressor(n_jobs=-1, random_state=2018) gs = GridSearchCV(est, hyper_space, scoring='r2', cv=4, verbose=1)

在这里插入图片描述 n_jobs=1：并行job个数。这个在ensemble算法中非常重要，尤其是bagging（而非boosting，因为boosting的每次迭代之间有影响，所以很难进行并行化），因为可以并行从而提高性能。1=不并行；n：n个并行；-1：CPU有多少core，就启动多少job。 “random_state”: 在这里插入图片描述 “verbose”: 就是控制输出信息的冗长程度。不用太在意。默认就好。类似这样的输出日志：信息的意思是：因为LightGBM使用的是leaf-wise的算法，因此在调节树的复杂程度时，使用的是num_leaves而不是max_depth。大致换算关系：num_leaves = 2^(max_depth)。它的值的设置应该小于 2^(max_depth)，否则可能会导致过拟合。

下面有一个类似的讲解，但不是lgbm的参数讲解。在这里插入图片描述第二段代码，关于训练与验证、画图的分析参见下篇博客吧。太长看着累。

2018-12-27补充：在这里插入图片描述

【本文地址】

公司简介

联系我们