「机器学习」天池比赛：金融风控贷款违约预测

您所在的位置：网站首页 › 金融贷款数据分析方法 › 「机器学习」天池比赛：金融风控贷款违约预测

「机器学习」天池比赛：金融风控贷款违约预测

2023-12-31 17:44| 来源: 网络整理| 查看: 265

一、前言 1.1 赛题背景

赛题以金融风控中的个人信贷为背景，要求选手根据贷款申请人的数据信息预测其是否有违约的可能，以此判断是否通过此项贷款，这是一个典型的分类问题。

任务：预测用户贷款是否违约

比赛地址：https://tianchi.aliyun.com/competition/entrance/531830/introduction

1.2 赛题数据

数据来自某信贷平台的贷款记录，总数据量超过120w，包含47列变量信息，其中15列为匿名变量。

为了保证比赛的公平性，将会从中抽取80万条作为训练集，20万条作为测试集A，20万条作为测试集B，同时会对employmentTitle、purpose、postCode和title等信息进行脱敏。

数据集包含三个下载文件

train.csv:训练集test.csv:测试集sample_submit.csv:提交文件样式

train.csv数据 testA.csv数据

字段表

FieldDescriptionid为贷款清单分配的唯一信用证标识loanAmnt贷款金额term贷款期限（year）interestRate贷款利率installment分期付款金额grade贷款等级subGrade贷款等级之子级employmentTitle就业职称employmentLength就业年限（年）homeOwnership借款人在登记时提供的房屋所有权状况annualIncome年收入verificationStatus验证状态issueDate贷款发放的月份purpose借款人在贷款申请时的贷款用途类别postCode借款人在贷款申请中提供的邮政编码的前3位数字regionCode地区编码dti债务收入比delinquency_2years借款人过去2年信用档案中逾期30天以上的违约事件数ficoRangeLow借款人在贷款发放时的fico所属的下限范围ficoRangeHigh借款人在贷款发放时的fico所属的上限范围openAcc借款人信用档案中未结信用额度的数量pubRec贬损公共记录的数量pubRecBankruptcies公开记录清除的数量revolBal信贷周转余额合计revolUtil循环额度利用率，或借款人使用的相对于所有可用循环信贷的信贷金额totalAcc借款人信用档案中当前的信用额度总数initialListStatus贷款的初始列表状态applicationType表明贷款是个人申请还是与两个共同借款人的联合申请earliesCreditLine借款人最早报告的信用额度开立的月份title借款人提供的贷款名称policyCode公开可用的策略_代码=1新产品不公开可用的策略_代码=2n系列匿名特征匿名特征n0-n14，为一些贷款人行为计数特征的处理 1.3 评价指标

提交结果为每个测试样本是1的概率，也就是y为1的概率。

评价方法为AUC评估模型效果（越大越好）。

注：AUC（Area Under Curve）被定义为 ROC曲线下与坐标轴围成的面积。

详细参见：「机器学习」分类算法常见的评估指标机器学习：评估指标

其次，除了要求的评价指标外，对于二分类问题其评价指标还有精确率、召回率、ROC、F值等

1.4 赛题整体流程

分析主要步骤如下在这里插入图片描述

二、探索性的数据分析EDA

数据探索性分析是对数据进行初步分析，了解数据特征，观察数据类型，分析数据分布等等，为后续特征工程，以及建模分析都特别重要

例如

分析数据中每个字段的含义、分布、缺失情况；字段表示什么含义、字段的类型是什么、字段的取值空间是什么、字段每个取值表示什么意义。字段整体的分布，分析字段在训练集/测试集中的分布情况。字段缺失值的分布比例，字段在训练集/测试集的缺失情况。分析数据中每个字段的与赛题标签的关系；分析数据字段两两之间，或者主者之间的关系；

引用图片：https://zhuanlan.zhihu.com/p/259788410

在这里插入图片描述

首先导入必要模块

import warnings warnings.filterwarnings("ignore") import numpy as np import pandas as pd import seaborn as sns import matplotlib.pyplot as plt import statsmodels.formula.api as smf from sklearn.preprocessing import LabelEncoder from sklearn.feature_selection import SelectKBest from sklearn.model_selection import train_test_split from sklearn.model_selection import StratifiedKFold, KFold from sklearn.feature_selection import SelectKBest from sklearn.feature_selection import chi2 from sklearn.preprocessing import MinMaxScaler import xgboost as xgb import lightgbm as lgb from catboost import CatBoostRegressor # 评价指标 from sklearn.metrics import accuracy_score, f1_score, roc_auc_score, log_loss plt.rcParams["font.sans-serif"]=["SimHei"] plt.rcParams["axes.unicode_minus"]=False

使用pandas读入数据，包括训练集与测试集

导入数据集（数据集过大可以进行瘦身处理）

train = pd.read_csv('train.csv') test = pd.read_csv('testA.csv')

查看部分数据

train.head() idloanAmntterminterestRateinstallmentgradesubGradeemploymentTitleemploymentLengthhomeOwnership…n5n6n7n8n9n10n11n12n13n140035000.0519.52917.97EE2320.02 years2…9.08.04.012.02.07.00.00.00.02.01118000.0518.49461.90DD2219843.05 years0…NaNNaNNaNNaNNaN13.0NaNNaNNaNNaN2212000.0516.99298.17DD331698.08 years0…0.021.04.05.03.011.00.00.00.04.03311000.037.26340.96AA446854.010+ years1…16.04.07.021.06.09.00.00.00.01.0443000.0312.99101.07CC254.0NaN1…4.09.010.015.07.012.00.00.00.04.0

5 rows × 47 columns

2.1 总体分布

前面提到，整个数据包括80万条训练集，20万条测试集A，20万条测试集B

另外

训练集中有47列，其中包括46个特征列，1个标签列测试集中只有46个特征列

# 样本个数和特征维度 train.shape # (800000, 47) test.shape # (200000, 46)

查看特征名

train.columns ''' Index(['id', 'loanAmnt', 'term', 'interestRate', 'installment', 'grade', 'subGrade', 'employmentTitle', 'employmentLength', 'homeOwnership', 'annualIncome', 'verificationStatus', 'issueDate', 'isDefault', 'purpose', 'postCode', 'regionCode', 'dti', 'delinquency_2years', 'ficoRangeLow', 'ficoRangeHigh', 'openAcc', 'pubRec', 'pubRecBankruptcies', 'revolBal', 'revolUtil', 'totalAcc', 'initialListStatus', 'applicationType', 'earliesCreditLine', 'title', 'policyCode', 'n0', 'n1', 'n2', 'n3', 'n4', 'n5', 'n6', 'n7', 'n8', 'n9', 'n10', 'n11', 'n12', 'n13', 'n14'], dtype='object') '''

接下来查看数据集的一些基本信息（缺失情况、类型…）

train.info() ''' RangeIndex: 800000 entries, 0 to 799999 Data columns (total 47 columns): id 800000 non-null int64 loanAmnt 800000 non-null float64 term 800000 non-null int64 interestRate 800000 non-null float64 installment 800000 non-null float64 grade 800000 non-null object subGrade 800000 non-null object employmentTitle 799999 non-null float64 employmentLength 753201 non-null object homeOwnership 800000 non-null int64 annualIncome 800000 non-null float64 verificationStatus 800000 non-null int64 issueDate 800000 non-null object isDefault 800000 non-null int64 purpose 800000 non-null int64 postCode 799999 non-null float64 regionCode 800000 non-null int64 dti 799761 non-null float64 delinquency_2years 800000 non-null float64 ficoRangeLow 800000 non-null float64 ficoRangeHigh 800000 non-null float64 openAcc 800000 non-null float64 pubRec 800000 non-null float64 pubRecBankruptcies 799595 non-null float64 revolBal 800000 non-null float64 revolUtil 799469 non-null float64 totalAcc 800000 non-null float64 initialListStatus 800000 non-null int64 applicationType 800000 non-null int64 earliesCreditLine 800000 non-null object title 799999 non-null float64 policyCode 800000 non-null float64 n0 759730 non-null float64 n1 759730 non-null float64 n2 759730 non-null float64 n3 759730 non-null float64 n4 766761 non-null float64 n5 759730 non-null float64 n6 759730 non-null float64 n7 759730 non-null float64 n8 759729 non-null float64 n9 759730 non-null float64 n10 766761 non-null float64 n11 730248 non-null float64 n12 759730 non-null float64 n13 759730 non-null float64 n14 759730 non-null float64 dtypes: float64(33), int64(9), object(5) memory usage: 286.9+ MB '''

可以看到，许多特征存在缺失，特征的类型有dtypes: float64(33), int64(9), object(5)

对于缺失值的处理以及类型转换将在特征工程中说明

接下来查看一下数据的描述性分析

描述性统计加深对数据分布、数据结构等的理解看一下数据特征之间的两两关联关系数据中空值的个数、0的个数、正值或负值的个数，以及均值、方差、最小值、最大值、偏度、峰度等。 train.describe() # train.describe().T

大致了解一下数据的分布、结构，简单的看一下特征值有没有什么异常

idloanAmntterminterestRateinstallmentemploymentTitlehomeOwnershipannualIncomeverificationStatusisDefault…n5n6n7n8n9n10n11n12n13n14count800000.000000800000.000000800000.000000800000.000000800000.000000799999.000000800000.0000008.000000e+05800000.000000800000.000000…759730.000000759730.000000759730.000000759729.000000759730.000000766761.000000730248.000000759730.000000759730.000000759730.000000mean399999.50000014416.8188753.48274513.238391437.94772372005.3517140.6142137.613391e+041.0096830.199513…8.1079378.5759948.28295314.6224885.59234511.6438960.0008150.0033840.0893662.178606std230940.2520158716.0861780.8558324.765757261.460393106585.6402040.6757496.894751e+040.7827160.399634…4.7992107.4005364.5616898.1246103.2161845.4841040.0300750.0620410.5090691.844377min0.000000500.0000003.0000005.31000015.6900000.0000000.0000000.000000e+000.0000000.000000…0.0000000.0000000.0000001.0000000.0000000.0000000.0000000.0000000.0000000.00000025%199999.7500008000.0000003.0000009.750000248.450000427.0000000.0000004.560000e+040.0000000.000000…5.0000004.0000005.0000009.0000003.0000008.0000000.0000000.0000000.0000001.00000050%399999.50000012000.0000003.00000012.740000375.1350007755.0000001.0000006.500000e+041.0000000.000000…7.0000007.0000007.00000013.0000005.00000011.0000000.0000000.0000000.0000002.00000075%599999.25000020000.0000003.00000015.990000580.710000117663.5000001.0000009.000000e+042.0000000.000000…11.00000011.00000010.00000019.0000007.00000014.0000000.0000000.0000000.0000003.000000max799999.00000040000.0000005.00000030.9900001715.420000378351.0000005.0000001.099920e+072.0000001.000000…70.000000132.00000079.000000128.00000045.00000082.0000004.0000004.00000039.00000030.000000 2.2 数据类型分析 2.2.1 数值类型（连续变量、离散型变量和单值变量）

这里引用文章观点：https://blog.csdn.net/qq_43401035/article/details/108648912

在这里插入图片描述

数值类型

# 数值类型 numerical_feature = list(train.select_dtypes(exclude=['object']).columns) numerical_feature ['id', 'loanAmnt', 'term', 'interestRate', 'installment', 'employmentTitle', 'homeOwnership', 'annualIncome', 'verificationStatus', 'isDefault', 'purpose', 'postCode', 'regionCode', 'dti', 'delinquency_2years', 'ficoRangeLow', 'ficoRangeHigh', 'openAcc', 'pubRec', 'pubRecBankruptcies', 'revolBal', 'revolUtil', 'totalAcc', 'initialListStatus', 'applicationType', 'title', 'policyCode', 'n0', 'n1', 'n2', 'n3', 'n4', 'n5', 'n6', 'n7', 'n8', 'n9', 'n10', 'n11', 'n12', 'n13', 'n14']

一共有42个数值型变量（dtypes: float64(33), int64(9), object(5)）

len(numerical_feature) ## 42

由于数值类型又可以分为连续变量、离散型变量和单值变量

因此接下来把数值型中的连续型变量和离散型变量区分开来：

# 连续型变量 serial_feature = [] # 离散型变量 discrete_feature = [] # 单值变量 unique_feature = [] for fea in numerical_feature: temp = train[fea].nunique()# 返回的是唯一值的个数 if temp == 1: unique_feature.append(fea) # 自定义变量的值的取值个数小于10就为离散型变量 elif temp 'level_0':'feature_x', 'level_1':'feature_y', 0:'corr'}, axis=1, inplace=True) highRelatedFeatureDf = highRelatedFeatureDf[highRelatedFeatureDf.feature_x != highRelatedFeatureDf.feature_y] highRelatedFeatureDf['feature_pair_key'] = highRelatedFeatureDf.loc[:,['feature_x', 'feature_y']].apply(lambda r:'#'.join(np.sort(r.values)), axis=1) highRelatedFeatureDf.drop_duplicates(subset=['feature_pair_key'],inplace=True) highRelatedFeatureDf.drop(['feature_pair_key'], axis=1, inplace=True) return highRelatedFeatureDf getHighRelatedFeatureDf(train.corr(),0.6) ''' feature_x feature_y corr 2 loanAmnt installment 0.953369 5 interestRate grade 0.953269 6 interestRate subGrade 0.970847 11 grade subGrade 0.993907 22 delinquency_2years n13 0.658946 24 ficoRangeLow ficoRangeHigh 1.000000 28 openAcc totalAcc 0.700796 29 openAcc n2 0.658807 30 openAcc n3 0.658807 31 openAcc n4 0.618207 32 openAcc n7 0.830624 33 openAcc n8 0.646342 34 openAcc n9 0.660917 35 openAcc n10 0.998717 37 pubRec pubRecBankruptcies 0.644402 44 totalAcc n5 0.623639 45 totalAcc n6 0.678482 46 totalAcc n8 0.761854 47 totalAcc n10 0.697192 53 n1 n2 0.807789 54 n1 n3 0.807789 55 n1 n4 0.829016 56 n1 n7 0.651852 57 n1 n9 0.800925 61 n2 n3 1.000000 62 n2 n4 0.663186 63 n2 n7 0.790337 64 n2 n9 0.982015 65 n2 n10 0.655296 70 n3 n4 0.663186 71 n3 n7 0.790337 72 n3 n9 0.982015 73 n3 n10 0.655296 79 n4 n5 0.717936 80 n4 n7 0.742157 81 n4 n9 0.639867 82 n4 n10 0.614658 86 n5 n7 0.618970 87 n5 n8 0.838066 97 n7 n8 0.774955 98 n7 n9 0.794465 99 n7 n10 0.829799 105 n8 n10 0.640729 113 n9 n10 0.660395 '''

结果分析：

1） "loanAmnt"贷款金额，"installment"分期付款金额两个特征间相关系数为0.95

2）"ficoRangeLow"fico所属的下限范围，"ficoRangeHigh"fico所属的上限范围两个特征间相关系数为1

3）"openAcc"未结信用额度的数量，“n10” 两个特征间相关系数为0.93

4）“n3”，"n2"两个特征间相关系数为1；“n3”，“n9” 两个特征间相关系数为0.98

根据高相关特征，综合考虑他们与目标的相关性，删除特征"installment",“ficoRangeHigh”,“openAcc”,“n3”,“n9”

col = ['installment','ficoRangeHigh','openAcc','n3','n9'] for data in [train,test]: data.drop(col,axis=1,inplace=True) train.shape # (800000, 54)

其余高相关的特征可以使用PCA进行降维处理（参考：https://zhuanlan.zhihu.com/p/255105477?utm_source=wechat_session）

注：这里不处理了

4.低方差过滤

train.var().sort_values()

结合相关性过滤方差小于0.1的特征"applicationType"

col = ['applicationType'] for data in [train,test]: data.drop(col,axis=1,inplace=True) train.shape # (800000, 53)

总结

特征选择中对高相关性的特征进行删除、PCA降维，处理的可能不太合适，可尝试使用过滤法、包装法、嵌入法等特征选择方法进行特征的筛选

3.10 样本不平衡处理

若分类问题中各类别样本数量差距太大，则会造成样本不均衡的问题。样本不均衡不利于建立与训练出正确的模型，且不能做出合理的评估。

label.value_counts()/len(label) ''' 0 0.800488 1 0.199513 Name: isDefault, dtype: float64 '''

1.上采样

过抽样（也叫上采样、over-sampling）方法通过增加分类中少数类样本的数量来实现样本均衡，最直接的方法是简单复制少数类样本形成多条记录

参考：https://zhuanlan.zhihu.com/p/255105477?utm_source=wechat_session

import imblearn from imblearn.over_sampling import SMOTE over_samples = SMOTE(random_state=1234) train_over,label_over = over_samples.fit_sample(train, label) train_over.to_csv('train_over.csv',index=False) label_over.to_csv('label_over.csv',index=False) print(label_over.value_counts()/len(label_over)) print(train_over.shape)

2.下采样

欠抽样（也叫下采样、under-sampling）方法通过减少分类中多数类样本的样本数量来实现样本均衡，最直接的方法是随机地去掉一些多数类样本来减小多数类的规模

参考：https://zhuanlan.zhihu.com/p/255105477?utm_source=wechat_session

from imblearn.under_sampling import RandomUnderSampler under_samples = RandomUnderSampler(random_state=1234) train_under, label_under = under_samples.fit_sample(train,label) train_under.to_csv('train_under.csv',index=False) label_under.to_csv('label_under.csv',index=False) print(label_under.value_counts()/len(label_under)) print(train_under.shape)

注：这里没有进行采样，如果做了可以分别利用上采样后的数据跑模型和下采样后的数据跑模型

四、建模分析

在完成相关的特征处理后，接下来进行建模分析，通过调节参数得到性能更强的模型

4.1 LightGBM

参考：https://zhuanlan.zhihu.com/p/256310383

X = train.drop(['isDefault'], axis=1) y = train.loc[:,'isDefault'] kf = KFold(n_splits=5, shuffle=True, random_state=525) X_train_split, X_val, y_train_split, y_val = train_test_split(X, y, test_size=0.2)

使用5折交叉验证法对数据进行验证和训练

import lightgbm as lgb cv_scores = [] for i, (train_index, val_index) in enumerate(kf.split(X, y)): X_train, y_train, X_val, y_val = X.iloc[train_index], y.iloc[train_index], X.iloc[val_index], y.iloc[val_index] train_matrix = lgb.Dataset(X_train, label=y_train) valid_matrix = lgb.Dataset(X_val, label=y_val) params = { 'boosting_type': 'gbdt', 'objective': 'binary', 'learning_rate': 0.1, 'metric': 'auc', 'min_child_weight': 1e-3, 'num_leaves': 31, 'max_depth': -1, 'seed': 525, 'nthread': 8, 'silent': True, } model = lgb.train(params, train_set=train_matrix, num_boost_round=20000, valid_sets=valid_matrix, verbose_eval=1000, early_stopping_rounds=200) val_pred = model.predict(X_val, num_iteration=model.best_iteration) cv_scores.append(roc_auc_score(y_val, val_pred)) print(cv_scores) print("lgb_scotrainre_list:{}".format(cv_scores)) print("lgb_score_mean:{}".format(np.mean(cv_scores))) print("lgb_score_std:{}".format(np.std(cv_scores)))

lgb_scotrainre_list:[0.7303837315833632, 0.7258692125145638, 0.7305149209921737, 0.7296117869375041, 0.7294438695369077] lgb_score_mean:0.7291647043129024 lgb_score_std:0.0016998349834934656

ROC曲线

from sklearn import metrics from sklearn.metrics import roc_auc_score al_pre_lgb = model.predict(X_val, num_iteration=model.best_iteration) fpr, tpr, threshold = metrics.roc_curve(y_val, val_pred) roc_auc = metrics.auc(fpr, tpr) print('AUC：{}'.format(roc_auc)) plt.figure(figsize=(8, 8)) plt.title('Validation ROC') plt.plot(fpr, tpr, 'b', label = 'Val AUC = %0.4f' % roc_auc) plt.ylim(0,1) plt.xlim(0,1) plt.legend(loc='best') plt.title('ROC') plt.ylabel('True Positive Rate') plt.xlabel('False Positive Rate') # 画出对角线 plt.plot([0,1],[0,1],'r--') plt.show()

AUC得分为0.7338

4.2 XGBoost X = train.drop(['isDefault'], axis=1) y = train.loc[:,'isDefault'] Xtrain, Xtest, Ytrain, Ytest = train_test_split(X,y,test_size=0.3)

用XGBClassifier模型跑一下（具体的xgboost 参数设置可以参考官网）

from xgboost.sklearn import XGBClassifier clf1 = XGBClassifier(n_jobs=-1) clf1.fit(Xtrain,Ytrain) clf1.score(Xtest,Ytest)

0.8068791666666667

计算模型结构的AUC面积

from sklearn.metrics import roc_curve, auc predict_proba = clf1.predict_proba(Xtest) false_positive_rate, true_positive_rate, thresholds = roc_curve(Ytest, predict_proba[:,1]) auc(false_positive_rate, true_positive_rate)

0.7326304866618416

4.3 三个模型比较 gra=GradientBoostingClassifier() xgb=XGBClassifier() lgb=LGBMClassifier() models=[gra,xgb,lgb] model_names=["gra","xgb","lgb"] #交叉验证看看上述3个算法评分 for i,model in enumerate(models): score=cross_val_score(model,X,y,cv=5,scoring="accuracy",n_jobs=-1) print(model_names[i],np.array(score).round(3),round(score.mean(),3))

其他建模方法

参见：

金融风控-贷款违约预测数据竞赛入门-金融风控（贷款违约预测）四、建模与调参

以及

数据挖掘实践（金融风控-贷款违约预测）（四）：建模与调参

尝试多种模型在这里插入图片描述

五、模型调参 5.1 调参方法

（1）贪心调参

参考：https://www.jianshu.com/p/cdf0a9ffec6f 在这里插入图片描述

（2）网格搜索

参考：https://www.jianshu.com/p/cdf0a9ffec6f

sklearn 提供GridSearchCV用于进行网格搜索，只需要把模型的参数输进去，就能给出最优化的结果和参数。相比起贪心调参，网格搜索的结果会更优，但是网格搜索只适合于小数据集，一旦数据的量级上去了，很难得出结果。

（3）贝叶斯调参

参考：https://www.jianshu.com/p/cdf0a9ffec6f

贝叶斯调参的主要思想是：给定优化的目标函数(广义的函数，只需指定输入和输出即可，无需知道内部结构以及数学性质)，通过不断地添加样本点来更新目标函数的后验分布(高斯过程,直到后验分布基本贴合于真实分布）。简单的说，就是考虑了上一次参数的信息，从而更好的调整当前的参数。

5.2 XGboost调参

参考：https://zhuanlan.zhihu.com/p/255105477?utm_source=wechat_session

1.优化max_depth，min_child_weight

from xgboost.sklearn import XGBClassifier from sklearn.model_selection import GridSearchCV # 其余参数 other_params = {'learning_rate': 0.1, 'n_estimators': 100, 'max_depth': 5, 'min_child_weight': 1, 'seed': 0, 'subsample': 0.8, 'colsample_bytree': 0.8, 'gamma': 0, 'reg_alpha': 0, 'reg_lambda': 1} # 待调参数 param_test1 = { 'max_depth':list(range(4,9,2)), 'min_child_weight':list(range(1,6,2)) } xgb1 = XGBClassifier(**other_params) # 网格搜索 gs1 = GridSearchCV(xgb1,param_test1,cv = 5,scoring = 'roc_auc',n_jobs = -1,verbose=2) best_model1=gs1.fit(Xtrain,Ytrain) print('最优参数：',best_model1.best_params_) print('最佳模型得分：',best_model1.best_score_)

最优参数：{‘max_depth’：4，，‘min-childweight’：5} 最佳模型得分：0.7185495198261862

2.优化gamma参数

other_params = {'learning_rate': 0.1, 'n_estimators': 100, 'max_depth': 4, 'min_child_weight': 5, 'seed': 0, 'subsample': 0.8, 'colsample_bytree': 0.8, 'gamma': 0, 'reg_alpha': 0, 'reg_lambda': 1} param_test = { 'gaama':[0,0.05,0.1,0.2,0.3] } xgb = XGBClassifier(**other_params) gs = GridSearchCV(xgb,param_test,cv = 5,scoring = 'roc_auc',n_jobs = -1,verbose=2) best_model=gs.fit(Xtrain,Ytrain) print('最优参数：',best_model.best_params_) print('最佳模型得分：',best_model.best_score_)

最优参数：{‘gaama’：0} 最模得分：0.7185495198261862

3.subsample和colsample_bytree

other_params = {'learning_rate': 0.1, 'n_estimators': 100, 'max_depth': 4, 'min_child_weight': 5, 'seed': 0, 'subsample': 0.8, 'colsample_bytree': 0.8, 'gamma': 0, 'reg_alpha': 0, 'reg_lambda': 1} param_test = { 'subsample':[0.6,0.7,0.8,0.9], 'colsample_bytree':[0.6,0.7,0.8,0.9] } xgb = XGBClassifier(**other_params) gs = GridSearchCV(xgb,param_test,cv = 5,scoring = 'roc_auc',n_jobs = -1,verbose=2) best_model=gs.fit(Xtrain,Ytrain) print('最优参数：',best_model.best_params_) print('最佳模型得分：',best_model.best_score_)

最优参数：{‘colsample-bytree’：0.7，‘subsample’：0.7} 最佳模得分：0.7187964885978947

4.reg_alpha和reg_lambda

other_params = {'learning_rate': 0.1, 'n_estimators': 100, 'max_depth': 4, 'min_child_weight': 5, 'seed': 0, 'subsample': 0.7, 'colsample_bytree': 0.7, 'gamma': 0, 'reg_alpha': 0, 'reg_lambda': 1} param_test = { 'reg_alpha': [4,5,6,7], 'reg_lambda': [0,0.01,0.05, 0.1] } xgb = XGBClassifier(**other_params) gs = GridSearchCV(xgb,param_test,cv = 5,scoring = 'roc_auc',n_jobs = -1,verbose=2) best_model=gs.fit(Xtrain,Ytrain) print('最优参数：',best_model.best_params_) print('最佳模型得分：',best_model.best_score_)

最优参数：{‘reg-alpha’：5，‘reg-lambda’：0.01} 最佳模型得分：0.7194153615536154

5. learning_rate和n_estimators

other_params = {'learning_rate': 0.1, 'n_estimators': 100, 'max_depth': 4, 'min_child_weight': 5, 'seed': 0, 'subsample': 0.7, 'colsample_bytree': 0.7, 'gamma': 0, 'reg_alpha': 5, 'reg_lambda': 0.01} param_test = { 'learning_rate': [0.01, 0.05, 0.07, 0.1, 0.2], 'n_estimators': [100,200,300,400,500] } xgb = XGBClassifier(**other_params) gs = GridSearchCV(xgb,param_test,cv = 5,scoring = 'roc_auc',n_jobs = -1,verbose=2) best_model=gs.fit(Xtrain,Ytrain) print('最优参数：',best_model.best_params_) print('最佳模型得分：',best_model.best_score_)

最优参数：{‘learning-rate’：0.05，‘n-estimators’：400} 最佳模型得分：0.7207082359918353

通过调参后的最终模型

from xgboost.sklearn import XGBClassifier clf = XGBClassifier( learning_rate= 0.05, n_estimators= 400, max_depth= 4, min_child_weight= 5, seed= 0, subsample= 0.7, colsample_bytree= 0.7, gamma= 0, reg_alpha= 5, reg_lambda=0.01, n_jobs = -1) clf.fit(Xtrain,Ytrain) clf.score(Xtest,Ytest)

0.80934521

AUC面积

from sklearn.metrics import roc_curve, auc predict_proba = clf.predict_proba(Xtest) false_positive_rate, true_positive_rate, thresholds = roc_curve(Ytest, predict_proba[:,1]) auc(false_positive_rate, true_positive_rate)

0.74512067

这里做完后，还可以得出特征重要性

from xgboost import plot_importance plot_importance(clf) plt.show()

总结

调参过程综合要点：

（1）"n_estimators"基分类器数量越大，偏差越小，但时间有限，这里初步可选30

（2）"max_depth"越大偏差越小，方差越大，需综合考虑时间及拟合性

（3）"learning_rate"学习速率一般越小越好，只是耗时会更长

（4）"subsample"采样比例一般在[0.5,0.8]之间比较好

六、模型融合

模型融合是比赛后期上分的重要手段，模型融合后结果会有大幅提升，以下是模型融合的方式。

【机器学习】模型融合方法概述

1）平均法（Averaging）-针对回归问题

在这里插入图片描述

2）投票法（Voting）- 针对分类问题

简单投票法

加权投票法

硬投票法

模型 1：A - 99%、B - 1%，表示模型 1 认为该样本是 A 类型的概率为 99%，为 B 类型的概率为 1%

在这里插入图片描述

python实现

from xgboost import XGBClassifier from sklearn.linear_model import LogisticRegression from sklearn.ensemble import RandomForestClassifier from sklearn.ensemble import VotingClassifier from sklearn.model_selection import train_test_split,cross_val_score #划分数据交叉验证 clf1 = XGBClassifier(learning_rate=0.1, n_estimators=150, max_depth=3, min_child_weight=2, subsample=0.7, colsample_bytree=0.6, objective='binary:logistic') clf2 = RandomForestClassifier(n_estimators=50, max_depth=1, min_samples_split=4, min_samples_leaf=63,oob_score=True) clf3 = SVC(C=0.1) # 硬投票 eclf = VotingClassifier(estimators=[ ('xgb', clf1), ('rf', clf2), ('svc', clf3)], voting='hard') # 比较模型融合效果 for clf, label in zip([clf1, clf2, clf3, eclf], ['XGBBoosting', 'Random Forest', 'SVM', 'Ensemble']): scores = cross_val_score(clf, x, y, cv=5, scoring='accuracy') print("Accuracy: %0.2f (+/- %0.2f) [%s]" % (scores.mean(), scores.std(), label)) 软投票法

将所有模型预测样本为某一类别的概率的平均值作为标准

在这里插入图片描述

from xgboost import XGBClassifier from sklearn.linear_model import LogisticRegression from sklearn.ensemble import RandomForestClassifier from sklearn.ensemble import VotingClassifier from sklearn.model_selection import train_test_split,cross_val_score #划分数据交叉验证 clf1 = XGBClassifier(learning_rate=0.1, n_estimators=150, max_depth=3, min_child_weight=2, subsample=0.7, colsample_bytree=0.6, objective='binary:logistic') clf2 = RandomForestClassifier(n_estimators=50, max_depth=1, min_samples_split=4, min_samples_leaf=63,oob_score=True) clf3 = SVC(C=0.1) # 软投票 eclf = VotingClassifier(estimators=[ ('xgb', clf1), ('rf', clf2), ('svc', clf3)], voting='soft', weights=[2, 1, 1]) # 比较模型融合效果 for clf, label in zip([clf1, clf2, clf3, eclf], ['XGBBoosting', 'Random Forest', 'SVM', 'Ensemble']): scores = cross_val_score(clf, x, y, cv=5, scoring='accuracy') print("Accuracy: %0.2f (+/- %0.2f) [%s]" % (scores.mean(), scores.std(), label))

3）综合法

排序融合log融合

4）stacking/blending：

stacking（构建多层模型，并利用预测结果再拟合预测）blending（选取部分数据预测训练得到预测结果作为新特征，带入剩下的数据中预测。Blending只有一层，而Stacking有多层）

在这里插入图片描述

5） boosting/bagging

多树的提升方法，在xgboost，Adaboost,GBDT中已经用到

介绍完上述方法之后，回到赛题

这里使用之前的训练的lgb和xgb模型作为基分类器，逻辑回归作为目标分类器做stacking

from mlxtend.classifier import StackingClassifier gra=GradientBoostingClassifier() xgb=XGBClassifier() lgb=LGBMClassifier() lr = LogisticRegression() sclf = StackingClassifier(classifiers=[gra, xgb, lgb], use_probas=True, meta_classifier=lr) sclf.fit(Xtrain,Ytrain) pre =sclf.predict_proba(Xtest)[:,1] fpr, tpr, thresholds = roc_curve(Ytest, pre) score = auc(fpr, tpr) print(score)

总结

简单平均和加权平均是常用的两种比赛中模型融合的方式。其优点是快速、简单。stacking融合速度非常慢，同时stacking多层提升幅度并不能抵消其带来的时间和内存消耗，所以实际环境中应用还是有一定的难度。七结果部署

a) 预测评估数据集（通过验证数据集来验证被优化过的模型） b) 利用整个数据集生产模型（通过整个数据集来生成模型） c) 序列化模型（将模型序列化，以便于预测新数据）

当有新数据产生时，就可以采用这个模型来预测新数据。

至此 2021.3.29

【本文地址】

公司简介

联系我们