scikit-learn 0.23 的发行亮点#

我们高兴地宣布 scikit-learn 0.23 发布!添加了许多错误修复和改进,以及一些新的关键功能。我们在下面详细介绍了此版本的一些主要功能。**有关所有更改的详尽列表**,请参阅发行说明

要安装最新版本(使用pip)

pip install --upgrade scikit-learn

或使用conda

conda install -c conda-forge scikit-learn

广义线性模型和梯度提升的泊松损失#

现在可以使用期待已久的具有非正态损失函数的广义线性模型。特别是,实现了三个新的回归器:PoissonRegressorGammaRegressorTweedieRegressor。泊松回归器可用于模拟正整数计数或相对频率。在用户指南中了解更多信息。此外,HistGradientBoostingRegressor也支持新的“泊松”损失。

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import PoissonRegressor
from sklearn.ensemble import HistGradientBoostingRegressor

n_samples, n_features = 1000, 20
rng = np.random.RandomState(0)
X = rng.randn(n_samples, n_features)
# positive integer target correlated with X[:, 5] with many zeros:
y = rng.poisson(lam=np.exp(X[:, 5]) / 2)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=rng)
glm = PoissonRegressor()
gbdt = HistGradientBoostingRegressor(loss="poisson", learning_rate=0.01)
glm.fit(X_train, y_train)
gbdt.fit(X_train, y_train)
print(glm.score(X_test, y_test))
print(gbdt.score(X_test, y_test))
0.35776189065725783
0.42425183539869415

丰富的估计器可视化表示#

现在可以通过启用display='diagram'选项在笔记本中可视化估计器。这对于总结管道和其他复合估计器的结构特别有用,并且具有交互性以提供详细信息。单击下面的示例图像以展开管道元素。请参阅可视化复合估计器,了解如何使用此功能。

from sklearn import set_config
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.compose import make_column_transformer
from sklearn.linear_model import LogisticRegression

set_config(display="diagram")

num_proc = make_pipeline(SimpleImputer(strategy="median"), StandardScaler())

cat_proc = make_pipeline(
    SimpleImputer(strategy="constant", fill_value="missing"),
    OneHotEncoder(handle_unknown="ignore"),
)

preprocessor = make_column_transformer(
    (num_proc, ("feat1", "feat3")), (cat_proc, ("feat0", "feat2"))
)

clf = make_pipeline(preprocessor, LogisticRegression())
clf
Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('pipeline-1',
                                                  Pipeline(steps=[('simpleimputer',
                                                                   SimpleImputer(strategy='median')),
                                                                  ('standardscaler',
                                                                   StandardScaler())]),
                                                  ('feat1', 'feat3')),
                                                 ('pipeline-2',
                                                  Pipeline(steps=[('simpleimputer',
                                                                   SimpleImputer(fill_value='missing',
                                                                                 strategy='constant')),
                                                                  ('onehotencoder',
                                                                   OneHotEncoder(handle_unknown='ignore'))]),
                                                  ('feat0', 'feat2'))])),
                ('logisticregression', LogisticRegression())])
在Jupyter环境中,请重新运行此单元格以显示HTML表示或信任笔记本。
在GitHub上,HTML表示无法呈现,请尝试使用nbviewer.org加载此页面。


KMeans的可扩展性和稳定性改进#

KMeans估计器已完全重新设计,现在速度更快,更稳定。此外,Elkan算法现在与稀疏矩阵兼容。估计器使用基于OpenMP的并行处理,而不是依赖于joblib,因此n_jobs参数不再起作用。有关如何控制线程数量的更多详细信息,请参阅我们的并行处理说明。

import scipy
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
from sklearn.metrics import completeness_score

rng = np.random.RandomState(0)
X, y = make_blobs(random_state=rng)
X = scipy.sparse.csr_matrix(X)
X_train, X_test, _, y_test = train_test_split(X, y, random_state=rng)
kmeans = KMeans(n_init="auto").fit(X_train)
print(completeness_score(kmeans.predict(X_test), y_test))
0.6684259852425617

基于直方图的梯度提升估计器的改进#

HistGradientBoostingClassifierHistGradientBoostingRegressor进行了多项改进。除了上面提到的泊松损失外,这些估计器现在还支持样本权重。此外,还添加了一个自动提前停止标准:当样本数量超过10k时,默认启用提前停止。最后,用户现在可以定义单调约束,以根据特定特征的变化来约束预测。在下面的示例中,我们构建了一个目标,该目标通常与第一个特征正相关,并带有一些噪声。应用单调约束允许预测捕获第一个特征的全局效应,而不是拟合噪声。有关用例示例,请参阅直方图梯度提升树中的特征

import numpy as np
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split

# from sklearn.inspection import plot_partial_dependence
from sklearn.inspection import PartialDependenceDisplay
from sklearn.ensemble import HistGradientBoostingRegressor

n_samples = 500
rng = np.random.RandomState(0)
X = rng.randn(n_samples, 2)
noise = rng.normal(loc=0.0, scale=0.01, size=n_samples)
y = 5 * X[:, 0] + np.sin(10 * np.pi * X[:, 0]) - noise

gbdt_no_cst = HistGradientBoostingRegressor().fit(X, y)
gbdt_cst = HistGradientBoostingRegressor(monotonic_cst=[1, 0]).fit(X, y)

# plot_partial_dependence has been removed in version 1.2. From 1.2, use
# PartialDependenceDisplay instead.
# disp = plot_partial_dependence(
disp = PartialDependenceDisplay.from_estimator(
    gbdt_no_cst,
    X,
    features=[0],
    feature_names=["feature 0"],
    line_kw={"linewidth": 4, "label": "unconstrained", "color": "tab:blue"},
)
# plot_partial_dependence(
PartialDependenceDisplay.from_estimator(
    gbdt_cst,
    X,
    features=[0],
    line_kw={"linewidth": 4, "label": "constrained", "color": "tab:orange"},
    ax=disp.axes_,
)
disp.axes_[0, 0].plot(
    X[:, 0], y, "o", alpha=0.5, zorder=-1, label="samples", color="tab:green"
)
disp.axes_[0, 0].set_ylim(-3, 3)
disp.axes_[0, 0].set_xlim(-1, 1)
plt.legend()
plt.show()
plot release highlights 0 23 0

Lasso和ElasticNet的样本权重支持#

两个线性回归器LassoElasticNet现在支持样本权重。

from sklearn.model_selection import train_test_split
from sklearn.datasets import make_regression
from sklearn.linear_model import Lasso
import numpy as np

n_samples, n_features = 1000, 20
rng = np.random.RandomState(0)
X, y = make_regression(n_samples, n_features, random_state=rng)
sample_weight = rng.rand(n_samples)
X_train, X_test, y_train, y_test, sw_train, sw_test = train_test_split(
    X, y, sample_weight, random_state=rng
)
reg = Lasso()
reg.fit(X_train, y_train, sample_weight=sw_train)
print(reg.score(X_test, y_test, sw_test))
0.999791942438998

脚本总运行时间:(0分钟0.621秒)

相关示例

scikit-learn 1.4 版本亮点

scikit-learn 1.4 版本亮点

scikit-learn 0.24 版本亮点

scikit-learn 0.24 版本亮点

单调约束

单调约束

scikit-learn 1.1 版本亮点

scikit-learn 1.1 版本亮点

由Sphinx-Gallery生成的图库