注意

转到结尾下载完整的示例代码。或通过JupyterLite或Binder在浏览器中运行此示例

scikit-learn 0.22 的发行亮点#

我们高兴地宣布发布scikit-learn 0.22，其中包含许多错误修复和新功能！我们在下面详细介绍了此版本的一些主要功能。有关所有更改的详尽列表，请参阅发行说明。

要安装最新版本（使用pip）

pip install --upgrade scikit-learn

或使用conda

conda install -c conda-forge scikit-learn

# Authors: The scikit-learn developers
# SPDX-License-Identifier: BSD-3-Clause

新的绘图API#

一个新的绘图API可用于创建可视化。这个新的API允许快速调整绘图的可视效果，而无需任何重新计算。也可以将不同的绘图添加到同一个图形中。以下示例说明了plot_roc_curve，但也支持其他绘图工具，例如plot_partial_dependence、plot_precision_recall_curve和plot_confusion_matrix。在用户指南中阅读有关此新API的更多信息。

import matplotlib
import matplotlib.pyplot as plt

from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier

# from sklearn.metrics import plot_roc_curve
from sklearn.metrics import RocCurveDisplay
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.utils.fixes import parse_version

X, y = make_classification(random_state=0)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

svc = SVC(random_state=42)
svc.fit(X_train, y_train)
rfc = RandomForestClassifier(random_state=42)
rfc.fit(X_train, y_train)

# plot_roc_curve has been removed in version 1.2. From 1.2, use RocCurveDisplay instead.
# svc_disp = plot_roc_curve(svc, X_test, y_test)
# rfc_disp = plot_roc_curve(rfc, X_test, y_test, ax=svc_disp.ax_)
svc_disp = RocCurveDisplay.from_estimator(svc, X_test, y_test)
rfc_disp = RocCurveDisplay.from_estimator(rfc, X_test, y_test, ax=svc_disp.ax_)
rfc_disp.figure_.suptitle("ROC curve comparison")

plt.show()

堆叠分类器和回归器#

StackingClassifier和StackingRegressor允许您拥有一个带有最终分类器或回归器的估计器堆栈。堆叠泛化包括堆叠单个估计器的输出，并使用分类器来计算最终预测。堆叠允许通过使用其输出作为最终估计器的输入来利用每个单个估计器的优势。基础估计器在完整的X上拟合，而最终估计器使用cross_val_predict训练使用基础估计器的交叉验证预测。

在用户指南中阅读更多信息。

from sklearn.datasets import load_iris
from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC

X, y = load_iris(return_X_y=True)
estimators = [
    ("rf", RandomForestClassifier(n_estimators=10, random_state=42)),
    ("svr", make_pipeline(StandardScaler(), LinearSVC(dual="auto", random_state=42))),
]
clf = StackingClassifier(estimators=estimators, final_estimator=LogisticRegression())
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)
clf.fit(X_train, y_train).score(X_test, y_test)

0.9473684210526315

基于排列的特征重要性#

可以使用inspection.permutation_importance来获取任何已拟合估计器的每个特征重要性的估计值。

import matplotlib.pyplot as plt
import numpy as np

from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.inspection import permutation_importance

X, y = make_classification(random_state=0, n_features=5, n_informative=3)
feature_names = np.array([f"x_{i}" for i in range(X.shape[1])])

rf = RandomForestClassifier(random_state=0).fit(X, y)
result = permutation_importance(rf, X, y, n_repeats=10, random_state=0, n_jobs=2)

fig, ax = plt.subplots()
sorted_idx = result.importances_mean.argsort()

# `labels` argument in boxplot is deprecated in matplotlib 3.9 and has been
# renamed to `tick_labels`. The following code handles this, but as a
# scikit-learn user you probably can write simpler code by using `labels=...`
# (matplotlib < 3.9) or `tick_labels=...` (matplotlib >= 3.9).
tick_labels_parameter_name = (
    "tick_labels"
    if parse_version(matplotlib.__version__) >= parse_version("3.9")
    else "labels"
)
tick_labels_dict = {tick_labels_parameter_name: feature_names[sorted_idx]}
ax.boxplot(result.importances[sorted_idx].T, vert=False, **tick_labels_dict)
ax.set_title("Permutation Importance of each feature")
ax.set_ylabel("Features")
fig.tight_layout()
plt.show()

梯度提升对缺失值的原生支持#

ensemble.HistGradientBoostingClassifier和ensemble.HistGradientBoostingRegressor现在对缺失值（NaN）具有原生支持。这意味着在训练或预测时无需估算数据。

from sklearn.ensemble import HistGradientBoostingClassifier

X = np.array([0, 1, 2, np.nan]).reshape(-1, 1)
y = [0, 0, 1, 1]

gbdt = HistGradientBoostingClassifier(min_samples_leaf=1).fit(X, y)
print(gbdt.predict(X))

[0 0 1 1]

预计算的稀疏最近邻图#

大多数基于最近邻图的估计器现在接受预计算的稀疏图作为输入，以便对多个估计器拟合重用相同的图。要在管道中使用此功能，可以使用memory参数，以及两个新的转换器之一：neighbors.KNeighborsTransformer和neighbors.RadiusNeighborsTransformer。预计算也可以由自定义估计器执行，以使用替代实现，例如近似最近邻方法。在用户指南中查看更多详细信息。

from tempfile import TemporaryDirectory

from sklearn.manifold import Isomap
from sklearn.neighbors import KNeighborsTransformer
from sklearn.pipeline import make_pipeline

X, y = make_classification(random_state=0)

with TemporaryDirectory(prefix="sklearn_cache_") as tmpdir:
    estimator = make_pipeline(
        KNeighborsTransformer(n_neighbors=10, mode="distance"),
        Isomap(n_neighbors=10, metric="precomputed"),
        memory=tmpdir,
    )
    estimator.fit(X)

    # We can decrease the number of neighbors and the graph will not be
    # recomputed.
    estimator.set_params(isomap__n_neighbors=5)
    estimator.fit(X)

基于KNN的插补#

我们现在支持使用k-最近邻来完成缺失值的插补。

每个样本的缺失值使用训练集中找到的n_neighbors个最近邻的平均值来插补。如果两个样本都不是缺失的特征接近，则这两个样本接近。默认情况下，使用支持缺失值的欧几里得距离度量nan_euclidean_distances来查找最近邻。

在用户指南中阅读更多信息。

from sklearn.impute import KNNImputer

X = [[1, 2, np.nan], [3, 4, 3], [np.nan, 6, 5], [8, 8, 7]]
imputer = KNNImputer(n_neighbors=2)
print(imputer.fit_transform(X))

[[1.  2.  4. ]
 [3.  4.  3. ]
 [5.5 6.  5. ]
 [8.  8.  7. ]]

树剪枝#

现在可以在构建树后修剪大多数基于树的估计器。修剪基于最小成本复杂度。在用户指南中阅读更多详细信息。

X, y = make_classification(random_state=0)

rf = RandomForestClassifier(random_state=0, ccp_alpha=0).fit(X, y)
print(
    "Average number of nodes without pruning {:.1f}".format(
        np.mean([e.tree_.node_count for e in rf.estimators_])
    )
)

rf = RandomForestClassifier(random_state=0, ccp_alpha=0.05).fit(X, y)
print(
    "Average number of nodes with pruning {:.1f}".format(
        np.mean([e.tree_.node_count for e in rf.estimators_])
    )
)

Average number of nodes without pruning 22.3
Average number of nodes with pruning 6.4

从OpenML检索数据框#

datasets.fetch_openml现在可以返回pandas数据框，从而正确处理具有异构数据的dataset。

from sklearn.datasets import fetch_openml

titanic = fetch_openml("titanic", version=1, as_frame=True, parser="pandas")
print(titanic.data.head()[["pclass", "embarked"]])

   pclass embarked
     1        S
     1        S
     1        S
     1        S
     1        S

检查估计器的scikit-learn兼容性#

开发人员可以使用check_estimator检查其scikit-learn兼容估计器的兼容性。例如，check_estimator(LinearSVC())通过。

我们现在提供了一个pytest专用装饰器，它允许pytest独立运行所有检查并报告失败的检查。

..注意：: 此条目在0.24版中进行了略微更新，其中不再支持传递类：改为传递实例。

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.utils.estimator_checks import parametrize_with_checks


@parametrize_with_checks([LogisticRegression(), DecisionTreeRegressor()])
def test_sklearn_compatible_estimator(estimator, check):
    check(estimator)

ROC AUC现在支持多类分类#

roc_auc_score 函数也可用于多类别分类。目前支持两种平均策略：一对一算法计算成对ROC AUC分数的平均值，一对多算法计算每个类别相对于所有其他类别的ROC AUC分数的平均值。在这两种情况下，多类别ROC AUC分数都是根据模型计算出的样本属于特定类别的概率估计值计算得出的。OvO和OvR算法支持统一加权（average='macro'）和按流行度加权（average='weighted'）。

阅读更多内容，请参见用户指南。

from sklearn.datasets import make_classification
from sklearn.metrics import roc_auc_score
from sklearn.svm import SVC

X, y = make_classification(n_classes=4, n_informative=16)
clf = SVC(decision_function_shape="ovo", probability=True).fit(X, y)
print(roc_auc_score(y, clf.predict_proba(X), multi_class="ovo"))

0.9914160256410255

脚本总运行时间：（0分钟1.423秒）