scikit

2024-07-05 03:35:33| 来源: 网络整理| 查看: 265

Note

单击 here 下载完整的示例代码或通过 Binder 在浏览器中运行此示例

嵌套与非嵌套交叉验证

此示例比较了虹膜数据集分类器上的非嵌套和嵌套交叉验证策略。嵌套交叉验证（CV）通常用于训练模型，其中超参数也需要优化。嵌套 CV 估计基础模型及其（超）参数搜索的泛化误差。选择最大化非嵌套 CV 的参数会使模型对数据集产生偏差，从而产生过于乐观的分数。

没有嵌套 CV 的模型选择使用相同的数据来调整模型参数并评估模型性能。因此，信息可能“泄漏”到模型中并过度拟合数据。这种影响的大小主要取决于数据集的大小和模型的稳定性。请参阅 Cawley 和 Talbot [1] 对这些问题的分析。

为了避免这个问题，嵌套CV有效地使用了一系列训练/验证/测试集分割。在内部循环中（此处由 GridSearchCV 执行），通过将模型拟合到每个训练集来近似最大化分数，然后在验证集上选择（超）参数时直接最大化。在外循环中（此处为 cross_val_score ），泛化误差是通过对多个数据集分割的测试集分数进行平均来估计的。

下面的示例使用具有非线性内核的支持向量分类器，通过网格搜索构建具有优化超参数的模型。我们通过计算非嵌套和嵌套 CV 策略分数之间的差异来比较它们的性能。

See Also:

交叉验证：评估估计器性能调整估计器的超参数

References:

[1]

考利，GC；Talbot，NLC 关于模型选择中的过度拟合和性能评估中的后续选择偏差。J.马赫. 学习。研究报告 2010,11, 2079-2107。

Average difference of 0.007581 with std. dev. of 0.007833. from sklearn.datasets import load_iris from matplotlib import pyplot as plt from sklearn.svm import SVC from sklearn.model_selection import GridSearchCV, cross_val_score, KFold import numpy as np # 随机试验的数量 NUM_TRIALS = 30 # 加载数据集 iris = load_iris() X_iris = iris.data y_iris = iris.target # 设置要优化的参数的可能值 p_grid = {"C": [1, 10, 100], "gamma": [0.01, 0.1]} # 我们将使用带有 "rbf" 内核的支持向量分类器 svm = SVC(kernel="rbf") # 存储分数的数组 non_nested_scores = np.zeros(NUM_TRIALS) nested_scores = np.zeros(NUM_TRIALS) # 循环每次试验 for i in range(NUM_TRIALS): # 为内循环和外循环选择交叉验证技术， # 独立于数据集。 # 例如 "GroupKFold" 、 "LeaveOneOut" 、 "LeaveOneGroupOut" 等 inner_cv = KFold(n_splits=4, shuffle=True, random_state=i) outer_cv = KFold(n_splits=4, shuffle=True, random_state=i) # 非嵌套参数搜索和评分 clf = GridSearchCV(estimator=svm, param_grid=p_grid, cv=outer_cv) clf.fit(X_iris, y_iris) non_nested_scores[i] = clf.best_score_ # 带有参数优化的嵌套 CV clf = GridSearchCV(estimator=svm, param_grid=p_grid, cv=inner_cv) nested_score = cross_val_score(clf, X=X_iris, y=y_iris, cv=outer_cv) nested_scores[i] = nested_score.mean() score_difference = non_nested_scores - nested_scores print( "Average difference of {:6f} with std. dev. of {:6f}.".format( score_difference.mean(), score_difference.std() ) ) # 绘制嵌套和非嵌套 CV 每次试验的分数 plt.figure() plt.subplot(211) (non_nested_scores_line,) = plt.plot(non_nested_scores, color="r") (nested_line,) = plt.plot(nested_scores, color="b") plt.ylabel("score", fontsize="14") plt.legend( [non_nested_scores_line, nested_line], ["Non-Nested CV", "Nested CV"], bbox_to_anchor=(0, 0.4, 0.5, 0), ) plt.title( "Non-Nested and Nested Cross Validation on Iris Dataset", x=0.5, y=1.1, fontsize="15", ) # 绘制差异的条形图。 plt.subplot(212) difference_plot = plt.bar(range(NUM_TRIALS), score_difference) plt.xlabel("Individual Trial #") plt.legend( [difference_plot], ["Non-Nested CV - Nested CV Score"], bbox_to_anchor=(0, 1, 0.8, 0), ) plt.ylabel("score difference", fontsize="14") plt.show()

脚本总运行时间：（0分3.999秒）

Download Python source code: plot_nested_cross_validation_iris.py

Download Jupyter notebook: plot_nested_cross_validation_iris.ipynb

【本文地址】

公司简介

联系我们

今日新闻

点击排行

实验室常用的仪器、试剂和: 说到实验室常用到的东西，主要就分为仪器、试剂和耗

不用再找了，全球10大实验: 01、赛默飞世尔科技（热电）Thermo Fisher Scientif

三代水柜的量产巅峰T-72坦: 作者：寞寒最近，西边闹腾挺大，本来小寞以为忙完这

通风柜跟实验室通风系统有: 说到通风柜跟实验室通风，不少人都纠结二者到底是不

集消毒杀菌、烘干收纳为一: 厨房是家里细菌较多的地方，潮湿的环境、没有完全密

实验室设备之全钢实验台如: 全钢实验台是实验室家具中较为重要的家具之一，很多

图片新闻

实验室药品柜的特性有哪些: 实验室药品柜是实验室家具的重要组成部分之一，主要

小学科学实验中有哪些教学: 计算机计算器一般打孔器打气筒仪器车显微镜

实验室各种仪器原理动图讲: 1.紫外分光光谱UV分析原理：吸收紫外光能量，引起分

高中化学常见仪器及实验装: 1、可加热仪器：2、计量仪器：（1）仪器A的名称：量

微生物操作主要设备和器具: 今天盘点一下微生物操作主要设备和器具，别嫌我啰嗦

浅谈通风柜使用基本常识: 　众所周知，通风柜功能中最主要的就是排气功能。在

scikit

scikit

今日新闻

点击排行

推荐新闻

图片新闻

专题文章