注意

转到末尾以下载完整示例代码，或通过 JupyterLite 或 Binder 在浏览器中运行此示例

混合类型的列转换器#

本示例演示了如何使用 ColumnTransformer 将不同的预处理和特征提取流水线应用于不同的特征子集。这对于包含异构数据类型的数据集尤其方便，因为我们可能需要对数值特征进行缩放，并对分类特征进行独热编码。

在本示例中，数值数据在均值插补后进行标准化缩放。分类数据通过 OneHotEncoder 进行独热编码，它为缺失值创建一个新类别。我们通过使用卡方检验选择类别来进一步降低维度。

此外，我们展示了两种将列分派给特定预处理器的方式：按列名和按列数据类型。

最后，预处理流水线与一个简单的分类模型一起，使用 Pipeline 集成到一个完整的预测流水线中。

# Authors: The scikit-learn developers
# SPDX-License-Identifier: BSD-3-Clause

import numpy as np

from sklearn.compose import ColumnTransformer
from sklearn.datasets import fetch_openml
from sklearn.feature_selection import SelectPercentile, chi2
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import RandomizedSearchCV, train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler

np.random.seed(0)

从 https://www.openml.org/d/40945 加载数据

X, y = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True)

# Alternatively X and y can be obtained directly from the frame attribute:
# X = titanic.frame.drop('survived', axis=1)
# y = titanic.frame['survived']

通过名称选择列来使用 ColumnTransformer

我们将使用以下特征训练分类器

数值特征

age：浮点数；
fare：浮点数。

分类特征

embarked：编码为字符串的类别 {'C', 'S', 'Q'}；
sex：编码为字符串的类别 {'female', 'male'}；
pclass：序数整数 {1, 2, 3}。

我们为数值和分类数据创建预处理流水线。请注意，pclass 可以被视为分类特征或数值特征。

numeric_features = ["age", "fare"]
numeric_transformer = Pipeline(
    steps=[("imputer", SimpleImputer(strategy="median")), ("scaler", StandardScaler())]
)

categorical_features = ["embarked", "sex", "pclass"]
categorical_transformer = Pipeline(
    steps=[
        ("encoder", OneHotEncoder(handle_unknown="ignore")),
        ("selector", SelectPercentile(chi2, percentile=50)),
    ]
)
preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ]
)

将分类器追加到预处理流水线。现在我们有了一个完整的预测流水线。

clf = Pipeline(
    steps=[("preprocessor", preprocessor), ("classifier", LogisticRegression())]
)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

clf.fit(X_train, y_train)
print("model score: %.3f" % clf.score(X_test, y_test))

model score: 0.798

Pipeline 的HTML表示（显示图表）

当 Pipeline 在Jupyter Notebook中打印时，会显示估计器的HTML表示

clf

Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(strategy='median')),
                                                                  ('scaler',
                                                                   StandardScaler())]),
                                                  ['age', 'fare']),
                                                 ('cat',
                                                  Pipeline(steps=[('encoder',
                                                                   OneHotEncoder(handle_unknown='ignore')),
                                                                  ('selector',
                                                                   SelectPercentile(percentile=50,
                                                                                    score_func=<function chi2 at 0x7fad23c2dc60>))]),
                                                  ['embarked', 'sex',
                                                   'pclass'])])),
                ('classifier', LogisticRegression())])

在Jupyter环境中，请重新运行此单元格以显示HTML表示或信任该Notebook。
在GitHub上，HTML表示无法渲染，请尝试使用nbviewer.org加载此页面。

流水线

?Pipeline 文档i已拟合

参数

	步骤	[('preprocessor', ...), ('classifier', ...)]
	转换输入	无
	内存	无
	详细模式	假

预处理器: ColumnTransformer

?preprocessor: ColumnTransformer 文档

参数

	转换器	[('num', ...), ('cat', ...)]
	剩余	'drop'
	稀疏阈值	0.3
	n_jobs	无
	转换器权重	无
	详细模式	假
	verbose_feature_names_out	真
	force_int_remainder_cols	'deprecated'

数值

['age', 'fare']

SimpleImputer

?SimpleImputer 文档

参数

	缺失值	nan
	策略	'median'
	填充值	无
	复制	真
	添加指示器	假
	保留空特征	假

StandardScaler

?StandardScaler 文档

参数

	复制	真
	带均值	真
	带标准差	真

分类

['embarked', 'sex', 'pclass']

OneHotEncoder

?OneHotEncoder 文档

参数

	类别	'auto'
	丢弃	无
	稀疏输出	真
	数据类型	<class 'numpy.float64'>
	处理未知	'ignore'
	最小频率	无
	最大类别数	无
	特征名组合器	'concat'

SelectPercentile

?SelectPercentile 文档

参数

	评分函数	<function chi...x7fad23c2dc60>
	百分位数	50

LogisticRegression

?LogisticRegression 文档

参数

	惩罚	'l2'
	对偶	假
	容差	0.0001
	C	1.0
	拟合截距	真
	截距缩放	1
	类别权重	无
	随机状态	无
	求解器	'lbfgs'
	最大迭代次数	100
	多类	'deprecated'
	详细模式	0
	热启动	假
	n_jobs	无
	l1_ratio	无

通过数据类型选择列来使用 ColumnTransformer

在处理已清理的数据集时，可以使用列的数据类型来自动决定是将列视为数值特征还是分类特征。sklearn.compose.make_column_selector 提供了这种可能性。首先，我们只选择列的一个子集来简化示例。

subset_feature = ["embarked", "sex", "pclass", "age", "fare"]
X_train, X_test = X_train[subset_feature], X_test[subset_feature]

然后，我们内省每列数据类型的信息。

X_train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1047 entries, 1118 to 684
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype
---  ------    --------------  -----
 0   embarked  1045 non-null   category
 1   sex       1047 non-null   category
 2   pclass    1047 non-null   int64
 3   age       841 non-null    float64
 4   fare      1046 non-null   float64
dtypes: category(2), float64(2), int64(1)
memory usage: 35.0 KB

我们可以观察到，在使用 fetch_openml 加载数据时，embarked 和 sex 列被标记为 category 列。因此，我们可以使用此信息将分类列分派给 categorical_transformer，并将剩余列分派给 numerical_transformer。

注意

实际上，您将需要自己处理列数据类型。如果您希望某些列被视为 category，则需要将它们转换为分类列。如果您正在使用pandas，可以参考其关于分类数据的文档。

from sklearn.compose import make_column_selector as selector

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, selector(dtype_exclude="category")),
        ("cat", categorical_transformer, selector(dtype_include="category")),
    ]
)
clf = Pipeline(
    steps=[("preprocessor", preprocessor), ("classifier", LogisticRegression())]
)


clf.fit(X_train, y_train)
print("model score: %.3f" % clf.score(X_test, y_test))
clf

model score: 0.798

Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(strategy='median')),
                                                                  ('scaler',
                                                                   StandardScaler())]),
                                                  <sklearn.compose._column_transformer.make_column_selector object at 0x7facefad5090>),
                                                 ('cat',
                                                  Pipeline(steps=[('encoder',
                                                                   OneHotEncoder(handle_unknown='ignore')),
                                                                  ('selector',
                                                                   SelectPercentile(percentile=50,
                                                                                    score_func=<function chi2 at 0x7fad23c2dc60>))]),
                                                  <sklearn.compose._column_transformer.make_column_selector object at 0x7facefad41c0>)])),
                ('classifier', LogisticRegression())])

在Jupyter环境中，请重新运行此单元格以显示HTML表示或信任该Notebook。
在GitHub上，HTML表示无法渲染，请尝试使用nbviewer.org加载此页面。

流水线

?Pipeline 文档i已拟合

参数

	步骤	[('preprocessor', ...), ('classifier', ...)]
	转换输入	无
	内存	无
	详细模式	假

预处理器: ColumnTransformer

?preprocessor: ColumnTransformer 文档

参数

	转换器	[('num', ...), ('cat', ...)]
	剩余	'drop'
	稀疏阈值	0.3
	n_jobs	无
	转换器权重	无
	详细模式	假
	verbose_feature_names_out	真
	force_int_remainder_cols	'deprecated'

数值

<sklearn.compose._column_transformer.make_column_selector object at 0x7facefad5090>

SimpleImputer

?SimpleImputer 文档

参数

	缺失值	nan
	策略	'median'
	填充值	无
	复制	真
	添加指示器	假
	保留空特征	假

StandardScaler

?StandardScaler 文档

参数

	复制	真
	带均值	真
	带标准差	真

分类

<sklearn.compose._column_transformer.make_column_selector object at 0x7facefad41c0>

OneHotEncoder

?OneHotEncoder 文档

参数

	类别	'auto'
	丢弃	无
	稀疏输出	真
	数据类型	<class 'numpy.float64'>
	处理未知	'ignore'
	最小频率	无
	最大类别数	无
	特征名组合器	'concat'

SelectPercentile

?SelectPercentile 文档

参数

	评分函数	<function chi...x7fad23c2dc60>
	百分位数	50

LogisticRegression

?LogisticRegression 文档

参数

	惩罚	'l2'
	对偶	假
	容差	0.0001
	C	1.0
	拟合截距	真
	截距缩放	1
	类别权重	无
	随机状态	无
	求解器	'lbfgs'
	最大迭代次数	100
	多类	'deprecated'
	详细模式	0
	热启动	假
	n_jobs	无
	l1_ratio	无

由于基于数据类型的选择器将 pclass 列视为数值特征而非之前的分类特征，因此结果分数与上一个流水线的分数不完全相同。

selector(dtype_exclude="category")(X_train)

['pclass', 'age', 'fare']

selector(dtype_include="category")(X_train)

['embarked', 'sex']

在网格搜索中使用预测流水线

网格搜索也可以在 ColumnTransformer 对象中定义的各种预处理步骤上执行，同时将分类器的超参数作为 Pipeline 的一部分。我们将使用 RandomizedSearchCV 搜索数值预处理的插补策略和逻辑回归的正则化参数。此超参数搜索会随机选择由 n_iter 配置的固定数量的参数设置。或者，可以使用 GridSearchCV，但会评估参数空间的笛卡尔积。

param_grid = {
    "preprocessor__num__imputer__strategy": ["mean", "median"],
    "preprocessor__cat__selector__percentile": [10, 30, 50, 70],
    "classifier__C": [0.1, 1.0, 10, 100],
}

search_cv = RandomizedSearchCV(clf, param_grid, n_iter=10, random_state=0)
search_cv

调用“fit”会触发交叉验证搜索最佳超参数组合

search_cv.fit(X_train, y_train)

print("Best params:")
print(search_cv.best_params_)

Best params:
{'preprocessor__num__imputer__strategy': 'mean', 'preprocessor__cat__selector__percentile': 30, 'classifier__C': 100}

这些参数获得的内部交叉验证分数为

print(f"Internal CV score: {search_cv.best_score_:.3f}")

Internal CV score: 0.786

我们还可以将顶部的网格搜索结果内省为pandas数据框

import pandas as pd

cv_results = pd.DataFrame(search_cv.cv_results_)
cv_results = cv_results.sort_values("mean_test_score", ascending=False)
cv_results[
    [
        "mean_test_score",
        "std_test_score",
        "param_preprocessor__num__imputer__strategy",
        "param_preprocessor__cat__selector__percentile",
        "param_classifier__C",
    ]
].head(5)

	mean_test_score	std_test_score	param_preprocessor__num__imputer__strategy	param_preprocessor__cat__selector__percentile	param_classifier__C
7	0.786015	0.031020	均值	30	100.0
0	0.785063	0.030498	中位数	30	1.0
4	0.785063	0.030498	均值	10	10.0
2	0.785063	0.030498	均值	30	1.0
3	0.783149	0.030462	均值	30	0.1

最佳超参数已用于在完整训练集上重新拟合最终模型。我们可以在未用于超参数调优的保留测试数据上评估该最终模型。

print(
    "accuracy of the best model from randomized search: "
    f"{search_cv.score(X_test, y_test):.3f}"
)

accuracy of the best model from randomized search: 0.798

脚本总运行时间：（0分钟 1.172秒）

	估计器	Pipeline(step...egression())])
	参数分布	{'classifier__C': [0.1, 1.0, ...], 'preprocessor__cat__selector__percentile': [10, 30, ...], 'preprocessor__num__imputer__strategy': ['mean', 'median']}
	n_iter	10
	评分	无
	n_jobs	无
	重拟合	真
	cv	无
	详细模式	0
	预分派	'2*n_jobs'
	随机状态	0
	错误分数	nan
	返回训练分数	假

混合类型的列转换器#

本页