scikit-learn 0.23 的发布亮点#

我们很高兴地宣布 scikit-learn 0.23 的发布!添加了许多错误修复和改进,以及一些新的关键功能。我们将在下面详细介绍此版本的一些主要功能。有关所有更改的详尽列表,请参阅发布说明

要安装最新版本(使用 pip)

pip install --upgrade scikit-learn

或使用 conda

conda install -c conda-forge scikit-learn

广义线性模型和用于梯度提升的泊松损失#

现在可以使用具有非正态损失函数的广义线性模型。特别是,实现了三个新的回归器:PoissonRegressorGammaRegressorTweedieRegressor。泊松回归器可用于对正整数计数或相对频率进行建模。在用户指南中了解更多信息。此外,HistGradientBoostingRegressor 也支持新的“poisson”损失。

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import PoissonRegressor
from sklearn.ensemble import HistGradientBoostingRegressor

n_samples, n_features = 1000, 20
rng = np.random.RandomState(0)
X = rng.randn(n_samples, n_features)
# positive integer target correlated with X[:, 5] with many zeros:
y = rng.poisson(lam=np.exp(X[:, 5]) / 2)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=rng)
glm = PoissonRegressor()
gbdt = HistGradientBoostingRegressor(loss="poisson", learning_rate=0.01)
glm.fit(X_train, y_train)
gbdt.fit(X_train, y_train)
print(glm.score(X_test, y_test))
print(gbdt.score(X_test, y_test))
0.3577618906572577
0.42425183539869404

估计器的丰富视觉表示#

现在可以通过启用display='diagram' 选项在笔记本中可视化估计器。这对于总结管道和其他复合估计器的结构特别有用,并具有交互性以提供详细信息。单击下面的示例图像以展开管道元素。有关如何使用此功能,请参阅可视化复合估计器

from sklearn import set_config
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.compose import make_column_transformer
from sklearn.linear_model import LogisticRegression

set_config(display="diagram")

num_proc = make_pipeline(SimpleImputer(strategy="median"), StandardScaler())

cat_proc = make_pipeline(
    SimpleImputer(strategy="constant", fill_value="missing"),
    OneHotEncoder(handle_unknown="ignore"),
)

preprocessor = make_column_transformer(
    (num_proc, ("feat1", "feat3")), (cat_proc, ("feat0", "feat2"))
)

clf = make_pipeline(preprocessor, LogisticRegression())
clf
Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('pipeline-1',
                                                  Pipeline(steps=[('simpleimputer',
                                                                   SimpleImputer(strategy='median')),
                                                                  ('standardscaler',
                                                                   StandardScaler())]),
                                                  ('feat1', 'feat3')),
                                                 ('pipeline-2',
                                                  Pipeline(steps=[('simpleimputer',
                                                                   SimpleImputer(fill_value='missing',
                                                                                 strategy='constant')),
                                                                  ('onehotencoder',
                                                                   OneHotEncoder(handle_unknown='ignore'))]),
                                                  ('feat0', 'feat2'))])),
                ('logisticregression', LogisticRegression())])
在 Jupyter 环境中,请重新运行此单元格以显示 HTML 表示或信任笔记本。
在 GitHub 上,HTML 表示无法呈现,请尝试使用 nbviewer.org 加载此页面。


KMeans 的可扩展性和稳定性改进#

KMeans 估计器已完全重新设计,现在速度更快,更稳定。此外,Elkan 算法现在与稀疏矩阵兼容。估计器使用基于 OpenMP 的并行性而不是依赖于 joblib,因此n_jobs 参数不再起作用。有关如何控制线程数量的更多详细信息,请参阅我们的并行性 说明。

import scipy
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
from sklearn.metrics import completeness_score

rng = np.random.RandomState(0)
X, y = make_blobs(random_state=rng)
X = scipy.sparse.csr_matrix(X)
X_train, X_test, _, y_test = train_test_split(X, y, random_state=rng)
kmeans = KMeans(n_init="auto").fit(X_train)
print(completeness_score(kmeans.predict(X_test), y_test))
0.740575686247884

对基于直方图的梯度提升估计器的改进#

HistGradientBoostingClassifierHistGradientBoostingRegressor 进行了各种改进。除了上面提到的泊松损失之外,这些估计器现在还支持样本权重。此外,还添加了自动提前停止标准:当样本数量超过 10k 时,默认情况下会启用提前停止。最后,用户现在可以定义单调约束 以根据特定特征的变化来约束预测。在以下示例中,我们构建了一个目标,该目标通常与第一个特征正相关,并带有一些噪声。应用单调约束允许预测捕获第一个特征的全局影响,而不是拟合噪声。有关用例示例,请参阅直方图梯度提升树中的特征

import numpy as np
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split

# from sklearn.inspection import plot_partial_dependence
from sklearn.inspection import PartialDependenceDisplay
from sklearn.ensemble import HistGradientBoostingRegressor

n_samples = 500
rng = np.random.RandomState(0)
X = rng.randn(n_samples, 2)
noise = rng.normal(loc=0.0, scale=0.01, size=n_samples)
y = 5 * X[:, 0] + np.sin(10 * np.pi * X[:, 0]) - noise

gbdt_no_cst = HistGradientBoostingRegressor().fit(X, y)
gbdt_cst = HistGradientBoostingRegressor(monotonic_cst=[1, 0]).fit(X, y)

# plot_partial_dependence has been removed in version 1.2. From 1.2, use
# PartialDependenceDisplay instead.
# disp = plot_partial_dependence(
disp = PartialDependenceDisplay.from_estimator(
    gbdt_no_cst,
    X,
    features=[0],
    feature_names=["feature 0"],
    line_kw={"linewidth": 4, "label": "unconstrained", "color": "tab:blue"},
)
# plot_partial_dependence(
PartialDependenceDisplay.from_estimator(
    gbdt_cst,
    X,
    features=[0],
    line_kw={"linewidth": 4, "label": "constrained", "color": "tab:orange"},
    ax=disp.axes_,
)
disp.axes_[0, 0].plot(
    X[:, 0], y, "o", alpha=0.5, zorder=-1, label="samples", color="tab:green"
)
disp.axes_[0, 0].set_ylim(-3, 3)
disp.axes_[0, 0].set_xlim(-1, 1)
plt.legend()
plt.show()
plot release highlights 0 23 0

Lasso 和 ElasticNet 的样本权重支持#

两个线性回归器LassoElasticNet 现在支持样本权重。

from sklearn.model_selection import train_test_split
from sklearn.datasets import make_regression
from sklearn.linear_model import Lasso
import numpy as np

n_samples, n_features = 1000, 20
rng = np.random.RandomState(0)
X, y = make_regression(n_samples, n_features, random_state=rng)
sample_weight = rng.rand(n_samples)
X_train, X_test, y_train, y_test, sw_train, sw_test = train_test_split(
    X, y, sample_weight, random_state=rng
)
reg = Lasso()
reg.fit(X_train, y_train, sample_weight=sw_train)
print(reg.score(X_test, y_test, sw_test))
0.999791942438998

脚本的总运行时间:(0 分钟 0.714 秒)

相关示例

scikit-learn 1.4 的发布亮点

scikit-learn 1.4 的发布亮点

scikit-learn 0.24 的发布亮点

scikit-learn 0.24 的发布亮点

单调约束

单调约束

scikit-learn 1.1 的发布亮点

scikit-learn 1.1 的发布亮点

由 Sphinx-Gallery 生成的图库