scikit

2024-03-09 06:43| 来源: 网络整理| 查看: 265

スポンサーリンクはじめにPowerTransformerYeo-JohnsonBox-CoxQuantileTransformer実際にやってみる元の分布PowerTransformer(Yeo-Johnson)PowerTransformer(Box-Cox)QuantileTransformer参考はじめに

scikit-learnを使って、データを正規分布のように変換する方法を紹介します。

PowerTransformer

PowerTransformerでは、Yoe-JohnsonとBox-Coxでの変換が可能です。

sklearn.preprocessing.PowerTransformerExamples using sklearn.preprocessing.PowerTransformer: Compare the effect of different scalers on data with outliers Map data to a normal distribution

scikit-learnYeo-Johnsonpt = PowerTransformer(method='yeo-johnson') pt.fit(df[cols].values) df[cols] = pt.transform(df[cols].values)Box-Cox

こちらは負の値が扱えないことに注意してください。

pt = PowerTransformer(method='box-cox') pt.fit(df[cols].values) df[cols] = pt.transform(df[cols].values)QuantileTransformer

QuantileTransformerは分位数を使って正規分布に変換します。

sklearn.preprocessing.QuantileTransformerExamples using sklearn.preprocessing.QuantileTransformer: Partial Dependence and Individual Conditional Expectation Plots Effect of transforming the targets in ...

scikit-learn

n_quantilesで分位数の数を設定できます。

qt = QuantileTransformer(n_quantiles=100, output_distribution='normal', random_state=42) qt.fit(df[cols].values) df[cols] = qt.transform(df[cols].values)実際にやってみる

Kaggleのtitanicデータを使って実際にそれぞれ変換してみます。

Titanic - Machine Learning from Disaster | KaggleStart here! Predict survival on the Titanic and get familiar with ML basics

www.kaggle.com

今回は、AgeとFareのデータを変換してみます。

まずデータを読み込んで、今回は試してみるだけなので欠損データと負の値のデータを削除してしまいます。

from sklearn.preprocessing import PowerTransformer, QuantileTransformer import pandas as pd import seaborn as sns import matplotlib.pyplot as plt sns.set() df = pd.read_csv('../input/titanic/train.csv') df = df.dropna(subset=cols) df = df[df['Fare']>0] cols = ['Age', 'Fare'] df = df.dropna(subset=cols) df = df[df['Fare']>0]元の分布

元のデータの分布はそれぞれ下記の通りになります。

fig, axes = plt.subplots(1, 2, figsize=(20, 10)) axes = axes.ravel() for col, ax in zip(cols, axes): sns.histplot(df[col], ax=ax) plt.show()

PowerTransformer(Yeo-Johnson)

PowerTransformer(Yeo-Johnson)で変換します。

yj_cols = ['Age_yj', 'Fare_yj'] pt = PowerTransformer(method='yeo-johnson') pt.fit(df[cols].values) df[yj_cols] = pt.transform(df[cols].values)

変換した後の分布は下記のようになります。

fig, axes = plt.subplots(1, 2, figsize=(20, 10)) axes = axes.ravel() for col, ax in zip(yj_cols, axes): sns.histplot(df[col], ax=ax) plt.show()

PowerTransformer(Box-Cox)

PowerTransformer(Box-Cox)で変換します。

bc_cols = ['Age_bc', 'Fare_bc'] pt = PowerTransformer(method='box-cox') pt.fit(df[cols].values) df[bc_cols] = pt.transform(df[cols].values)

変換した後の分布は下記のようになります。

fig, axes = plt.subplots(1, 2, figsize=(20, 10)) axes = axes.ravel() for col, ax in zip(bc_cols, axes): sns.histplot(df[col], ax=ax) plt.show()

QuantileTransformer

QuantileTransformerで変換します。

qt_cols = ['Age_qt', 'Fare_qt'] qt = QuantileTransformer(n_quantiles=100, output_distribution='normal', random_state=42) qt.fit(df[cols].values) df[qt_cols] = qt.transform(df[cols].values)

変換した後の分布は下記のようになります。

fig, axes = plt.subplots(1, 2, figsize=(20, 10)) axes = axes.ravel() for col, ax in zip(qt_cols, axes): sns.histplot(df[col], ax=ax) plt.show()

参考sklearn.preprocessing.PowerTransformer — scikit-learn 1.1.2 documentationsklearn.preprocessing.QuantileTransformer — scikit-learn 1.1.2 documentation

【本文地址】

公司简介

联系我们