使用堆叠（stacking）组合预测器#

堆叠（Stacking）是指一种融合估计器的方法。在这种策略中，一些估计器会独立地在训练数据上进行拟合，而最终估计器则使用这些基础估计器的堆叠预测进行训练。

在本示例中，我们展示了将不同回归器堆叠在一起，并使用最终的线性正则化回归器输出预测结果的用例。我们比较了每个单独回归器与堆叠策略的性能。堆叠略微提升了整体性能。

# Authors: The scikit-learn developers
# SPDX-License-Identifier: BSD-3-Clause

下载数据集#

我们将使用 Ames Housing 数据集，该数据集最初由 Dean De Cock 整理，在 Kaggle 挑战中使用后变得更为人所知。它包含爱荷华州埃姆斯市的 1460 栋住宅，每栋住宅由 80 个特征描述。我们将使用它来预测房屋的最终对数价格。在本示例中，我们仅使用通过 GradientBoostingRegressor() 选择的 20 个最有趣的特征，并限制条目数量（此处我们不深入探讨如何选择最有趣的特征）。

Ames Housing 数据集未随 scikit-learn 提供，因此我们将从 OpenML 获取。

import numpy as np

from sklearn.datasets import fetch_openml
from sklearn.utils import shuffle


def load_ames_housing():
    df = fetch_openml(name="house_prices", as_frame=True)
    X = df.data
    y = df.target

    features = [
        "YrSold",
        "HeatingQC",
        "Street",
        "YearRemodAdd",
        "Heating",
        "MasVnrType",
        "BsmtUnfSF",
        "Foundation",
        "MasVnrArea",
        "MSSubClass",
        "ExterQual",
        "Condition2",
        "GarageCars",
        "GarageType",
        "OverallQual",
        "TotalBsmtSF",
        "BsmtFinSF1",
        "HouseStyle",
        "MiscFeature",
        "MoSold",
    ]

    X = X.loc[:, features]
    X, y = shuffle(X, y, random_state=0)

    X = X.iloc[:600]
    y = y.iloc[:600]
    return X, np.log(y)


X, y = load_ames_housing()

测量并绘制结果#

现在我们可以使用 Ames Housing 数据集进行预测。我们检查了每个单独预测器以及回归器堆叠的性能。

import time

import matplotlib.pyplot as plt

from sklearn.metrics import PredictionErrorDisplay
from sklearn.model_selection import cross_val_predict, cross_validate

fig, axs = plt.subplots(2, 2, figsize=(9, 7))
axs = np.ravel(axs)

for ax, (name, est) in zip(
    axs, estimators + [("Stacking Regressor", stacking_regressor)]
):
    scorers = {"R2": "r2", "MAE": "neg_mean_absolute_error"}

    start_time = time.time()
    scores = cross_validate(
        est, X, y, scoring=list(scorers.values()), n_jobs=-1, verbose=0
    )
    elapsed_time = time.time() - start_time

    y_pred = cross_val_predict(est, X, y, n_jobs=-1, verbose=0)
    scores = {
        key: (
            f"{np.abs(np.mean(scores[f'test_{value}'])):.2f} +- "
            f"{np.std(scores[f'test_{value}']):.2f}"
        )
        for key, value in scorers.items()
    }

    display = PredictionErrorDisplay.from_predictions(
        y_true=y,
        y_pred=y_pred,
        kind="actual_vs_predicted",
        ax=ax,
        scatter_kwargs={"alpha": 0.2, "color": "tab:blue"},
        line_kwargs={"color": "tab:red"},
    )
    ax.set_title(f"{name}\nEvaluation in {elapsed_time:.2f} seconds")

    for name, score in scores.items():
        ax.plot([], [], " ", label=f"{name}: {score}")
    ax.legend(loc="upper left")

plt.suptitle("Single predictors versus stacked predictors")
plt.tight_layout()
plt.subplots_adjust(top=0.9)
plt.show()

Single predictors versus stacked predictors, Random Forest Evaluation in 1.06 seconds, Lasso Evaluation in 0.24 seconds, Gradient Boosting Evaluation in 0.46 seconds, Stacking Regressor Evaluation in 8.94 seconds

堆叠回归器将结合不同回归器的优势。然而，我们也发现训练堆叠回归器的计算成本要高得多。

脚本总运行时间： (0 分 21.747 秒)

	转换器	[('simpleimputer', ...), ('ordinalencoder', ...)]
	剩余	'drop'
	稀疏阈值	0.3
	并行作业数	None
	转换器权重	None
	详细模式	False
	verbose_feature_names_out	True
	force_int_remainder_cols	'deprecated'

	缺失值	nan
	策略	'mean'
	填充值	None
	复制	True
	添加指示符	True
	保留空特征	False

	类别	'auto'
	数据类型	<class 'numpy.float64'>
	处理未知	'use_encoded_value'
	未知值	-1
	编码缺失值	-2
	最小频率	None
	最大类别数	None

	转换器	[('pipeline', ...), ('onehotencoder', ...)]
	剩余	'drop'
	稀疏阈值	0.3
	并行作业数	None
	转换器权重	None
	详细模式	False
	verbose_feature_names_out	True
	force_int_remainder_cols	'deprecated'

	缺失值	nan
	策略	'mean'
	填充值	None
	复制	True
	添加指示符	True
	保留空特征	False

使用堆叠（stacking）组合预测器#

下载数据集#

构建数据预处理流水线#

单个数据集上的预测器堆叠#

测量并绘制结果#

此页面

	类别	'auto'
	丢弃	None
	稀疏输出	True
	数据类型	<class 'numpy.float64'>
	处理未知	'ignore'
	最小频率	None
	最大类别数	None
	特征名称组合器	'concat'

	步骤	[('columntransformer', ...), ('lassocv', ...)]
	transform_input	None
	内存	None
	详细模式	False

	eps	0.001
	n_alphas	'deprecated'
	alphas	'warn'
	fit_intercept	True
	预计算	'auto'
	max_iter	1000
	tol	0.0001
	copy_X	True
	cv	None
	详细模式	False
	并行作业数	None
	positive	False
	random_state	None
	选择	'cyclic'

	步骤	[('columntransformer', ...), ('randomforestregressor', ...)]
	transform_input	None
	内存	None
	详细模式	False

	n_estimators	100
	criterion	'squared_error'
	max_depth	None
	min_samples_split	2
	min_samples_leaf	1
	min_weight_fraction_leaf	0.0
	max_features	1.0
	max_leaf_nodes	None
	min_impurity_decrease	0.0
	bootstrap	True
	oob_score	False
	并行作业数	None
	random_state	42
	详细模式	0
	warm_start	False
	ccp_alpha	0.0
	max_samples	None
	monotonic_cst	None

	步骤	[('columntransformer', ...), ('histgradientboostingregressor', ...)]
	transform_input	None
	内存	None
	详细模式	False

	损失	'squared_error'
	分位数	None
	学习率	0.1
	max_iter	100
	max_leaf_nodes	31
	max_depth	None
	min_samples_leaf	20
	l2_regularization	0.0
	max_features	1.0
	最大箱数	255
	categorical_features	'from_dtype'
	monotonic_cst	None
	interaction_cst	None
	warm_start	False
	early_stopping	'auto'
	评分	'loss'
	验证分数	0.1
	n_iter_no_change	10
	tol	1e-07
	详细模式	0
	random_state	0

	估计器	[('Random Forest', ...), ('Lasso', ...), ...]
	最终估计器	RidgeCV()
	cv	None
	并行作业数	None
	直通	False
	详细模式	0

	alphas	(0.1, ...)
	fit_intercept	True
	评分	None
	cv	None
	gcv_mode	None
	store_cv_results	False
	alpha_per_target	False