scikit-learn 1.2 的发行亮点#

我们高兴地宣布 scikit-learn 1.2 版本发布！此版本包含许多错误修复和改进，以及一些新的关键特性。我们在下面详细介绍了此版本的一些主要功能。**有关所有更改的详尽列表**，请参阅发行说明。

要安装最新版本（使用 pip）：

pip install --upgrade scikit-learn

或使用 conda：

conda install -c conda-forge scikit-learn

使用`set_output` API 的 Pandas 输出#

scikit-learn 的转换器现在通过set_output API 支持 Pandas 输出。要了解有关set_output API 的更多信息，请参阅示例：介绍 set_output API 和此视频，scikit-learn 转换器的 Pandas DataFrame 输出（一些示例）。

import numpy as np
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler, KBinsDiscretizer
from sklearn.compose import ColumnTransformer

X, y = load_iris(as_frame=True, return_X_y=True)
sepal_cols = ["sepal length (cm)", "sepal width (cm)"]
petal_cols = ["petal length (cm)", "petal width (cm)"]

preprocessor = ColumnTransformer(
    [
        ("scaler", StandardScaler(), sepal_cols),
        ("kbin", KBinsDiscretizer(encode="ordinal"), petal_cols),
    ],
    verbose_feature_names_out=False,
).set_output(transform="pandas")

X_out = preprocessor.fit_transform(X)
X_out.sample(n=5, random_state=0)

	萼片长度（厘米）	萼片宽度（厘米）	花瓣长度（厘米）	花瓣宽度（厘米）
114	-0.052506	-0.592373	3.0	4.0
62	0.189830	-1.973554	2.0	1.0
33	-0.416010	2.630382	0.0	1.0
107	1.765012	-0.362176	4.0	3.0
7	-1.021849	0.788808	1.0	1.0

基于直方图的梯度提升树中的交互约束#

HistGradientBoostingRegressor 和 HistGradientBoostingClassifier 现在通过interaction_cst 参数支持交互约束。详情请参阅用户指南。在以下示例中，不允许特征交互。

from sklearn.datasets import load_diabetes
from sklearn.ensemble import HistGradientBoostingRegressor

X, y = load_diabetes(return_X_y=True, as_frame=True)

hist_no_interact = HistGradientBoostingRegressor(
    interaction_cst=[[i] for i in range(X.shape[1])], random_state=0
)
hist_no_interact.fit(X, y)

HistGradientBoostingRegressor(interaction_cst=[[0], [1], [2], [3], [4], [5],
                                               [6], [7], [8], [9]],
                              random_state=0)

在 Jupyter 环境中，请重新运行此单元格以显示 HTML 表示形式或信任笔记本。
在 GitHub 上，HTML 表示形式无法呈现，请尝试使用 nbviewer.org 加载此页面。

新的和增强的显示#

PredictionErrorDisplay 提供了一种以定性方式分析回归模型的方法。

import matplotlib.pyplot as plt
from sklearn.metrics import PredictionErrorDisplay

fig, axs = plt.subplots(nrows=1, ncols=2, figsize=(12, 5))
_ = PredictionErrorDisplay.from_estimator(
    hist_no_interact, X, y, kind="actual_vs_predicted", ax=axs[0]
)
_ = PredictionErrorDisplay.from_estimator(
    hist_no_interact, X, y, kind="residual_vs_predicted", ax=axs[1]
)

LearningCurveDisplay 现在可用于绘制 learning_curve 的结果。

from sklearn.model_selection import LearningCurveDisplay

_ = LearningCurveDisplay.from_estimator(
    hist_no_interact, X, y, cv=5, n_jobs=2, train_sizes=np.linspace(0.1, 1, 5)
)

/home/circleci/miniforge3/envs/testenv/lib/python3.9/site-packages/joblib/externals/loky/backend/fork_exec.py:38: RuntimeWarning:

Using fork() can cause Polars to deadlock in the child process.
In addition, using fork() with Python in general is a recipe for mysterious
deadlocks and crashes.

The most likely reason you are seeing this error is because you are using the
multiprocessing module on Linux, which uses fork() by default. This will be
fixed in Python 3.14. Until then, you want to use the "spawn" context instead.

See https://docs.polars.org.cn/user-guide/misc/multiprocessing/ for details.

If you really know what your doing, you can silence this warning with the warning module
or by setting POLARS_ALLOW_FORKING_THREAD=1.

PartialDependenceDisplay 公开了新的参数categorical_features，以便使用条形图和热图显示分类特征的偏依赖性。

from sklearn.datasets import fetch_openml

X, y = fetch_openml(
    "titanic", version=1, as_frame=True, return_X_y=True, parser="pandas"
)
X = X.select_dtypes(["number", "category"]).drop(columns=["body"])

from sklearn.preprocessing import OrdinalEncoder
from sklearn.pipeline import make_pipeline

categorical_features = ["pclass", "sex", "embarked"]
model = make_pipeline(
    ColumnTransformer(
        transformers=[("cat", OrdinalEncoder(), categorical_features)],
        remainder="passthrough",
    ),
    HistGradientBoostingRegressor(random_state=0),
).fit(X, y)

from sklearn.inspection import PartialDependenceDisplay

fig, ax = plt.subplots(figsize=(14, 4), constrained_layout=True)
_ = PartialDependenceDisplay.from_estimator(
    model,
    X,
    features=["age", "sex", ("pclass", "sex")],
    categorical_features=categorical_features,
    ax=ax,
)

`fetch_openml` 中更快的解析器#

fetch_openml 现在支持新的"pandas" 解析器，它在内存和 CPU 方面更高效。在 v1.4 中，默认值将更改为parser="auto"，它将自动为密集数据使用"pandas" 解析器，为稀疏数据使用"liac-arff" 解析器。

X, y = fetch_openml(
    "titanic", version=1, as_frame=True, return_X_y=True, parser="pandas"
)
X.head()

	乘客等级	姓名	性别	年龄	兄弟姐妹/配偶数	父母/子女数	船票号	票价	客舱	登船港口	救生艇	遗体号	家乡/目的地
0	1	艾伦小姐，伊丽莎白·沃尔顿	女	29.0000	0	0	24160	211.3375	B5	S	2	NaN	圣路易斯，密苏里州
1	1	艾里森，小哈德森·特雷弗	男	0.9167	1	2	113781	151.5500	C22 C26	S	11	NaN	蒙特利尔，魁北克/切斯特维尔，安大略省
2	1	艾里森小姐，海伦·洛雷恩	女	2.0000	1	2	113781	151.5500	C22 C26	S	NaN	NaN	蒙特利尔，魁北克/切斯特维尔，安大略省
3	1	艾里森先生，哈德森·约书亚·克莱顿	男	30.0000	1	2	113781	151.5500	C22 C26	S	NaN	135.0	蒙特利尔，魁北克/切斯特维尔，安大略省
4	1	艾里森夫人，哈德森·J·C（贝西·沃尔多·丹尼尔斯）	女	25.0000	1	2	113781	151.5500	C22 C26	S	NaN	NaN	蒙特利尔，魁北克/切斯特维尔，安大略省

`LinearDiscriminantAnalysis` 中的实验性数组 API 支持#

对数组 API规范的实验性支持已添加到LinearDiscriminantAnalysis。估计器现在可以在任何符合数组 API 的库（例如CuPy，一个 GPU 加速的数组库）上运行。详情请参阅用户指南。

许多估计器的效率改进#

在 1.1 版本中，许多依赖于成对距离计算的估计器（基本上是与聚类、流形学习和邻居搜索算法相关的估计器）的效率对于 float64 密集输入得到了极大的提高。效率改进尤其体现在减少了内存占用以及在多核机器上的更好可扩展性。在 1.2 版本中，这些估计器的效率对于 float32 和 float64 数据集上密集和稀疏输入的所有组合都得到了进一步改进，但欧几里德距离和欧几里德距离平方度量的稀疏-密集和密集-稀疏组合除外。受影响的估计器的详细列表可以在变更日志中找到。

脚本总运行时间：（0 分钟 4.988 秒）