scikit-learn 0.23 的发布亮点#
我们很高兴地宣布 scikit-learn 0.23 的发布!添加了许多错误修复和改进,以及一些新的关键功能。我们将在下面详细介绍此版本的一些主要功能。有关所有更改的详尽列表,请参阅发布说明。
要安装最新版本(使用 pip)
pip install --upgrade scikit-learn
或使用 conda
conda install -c conda-forge scikit-learn
广义线性模型和用于梯度提升的泊松损失#
现在可以使用具有非正态损失函数的广义线性模型。特别是,实现了三个新的回归器:PoissonRegressor
、GammaRegressor
和 TweedieRegressor
。泊松回归器可用于对正整数计数或相对频率进行建模。在用户指南中了解更多信息。此外,HistGradientBoostingRegressor
也支持新的“poisson”损失。
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import PoissonRegressor
from sklearn.ensemble import HistGradientBoostingRegressor
n_samples, n_features = 1000, 20
rng = np.random.RandomState(0)
X = rng.randn(n_samples, n_features)
# positive integer target correlated with X[:, 5] with many zeros:
y = rng.poisson(lam=np.exp(X[:, 5]) / 2)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=rng)
glm = PoissonRegressor()
gbdt = HistGradientBoostingRegressor(loss="poisson", learning_rate=0.01)
glm.fit(X_train, y_train)
gbdt.fit(X_train, y_train)
print(glm.score(X_test, y_test))
print(gbdt.score(X_test, y_test))
0.3577618906572577
0.42425183539869404
估计器的丰富视觉表示#
现在可以通过启用display='diagram'
选项在笔记本中可视化估计器。这对于总结管道和其他复合估计器的结构特别有用,并具有交互性以提供详细信息。单击下面的示例图像以展开管道元素。有关如何使用此功能,请参阅可视化复合估计器。
from sklearn import set_config
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.compose import make_column_transformer
from sklearn.linear_model import LogisticRegression
set_config(display="diagram")
num_proc = make_pipeline(SimpleImputer(strategy="median"), StandardScaler())
cat_proc = make_pipeline(
SimpleImputer(strategy="constant", fill_value="missing"),
OneHotEncoder(handle_unknown="ignore"),
)
preprocessor = make_column_transformer(
(num_proc, ("feat1", "feat3")), (cat_proc, ("feat0", "feat2"))
)
clf = make_pipeline(preprocessor, LogisticRegression())
clf
KMeans 的可扩展性和稳定性改进#
KMeans
估计器已完全重新设计,现在速度更快,更稳定。此外,Elkan 算法现在与稀疏矩阵兼容。估计器使用基于 OpenMP 的并行性而不是依赖于 joblib,因此n_jobs
参数不再起作用。有关如何控制线程数量的更多详细信息,请参阅我们的并行性 说明。
import scipy
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
from sklearn.metrics import completeness_score
rng = np.random.RandomState(0)
X, y = make_blobs(random_state=rng)
X = scipy.sparse.csr_matrix(X)
X_train, X_test, _, y_test = train_test_split(X, y, random_state=rng)
kmeans = KMeans(n_init="auto").fit(X_train)
print(completeness_score(kmeans.predict(X_test), y_test))
0.740575686247884
对基于直方图的梯度提升估计器的改进#
对HistGradientBoostingClassifier
和 HistGradientBoostingRegressor
进行了各种改进。除了上面提到的泊松损失之外,这些估计器现在还支持样本权重。此外,还添加了自动提前停止标准:当样本数量超过 10k 时,默认情况下会启用提前停止。最后,用户现在可以定义单调约束 以根据特定特征的变化来约束预测。在以下示例中,我们构建了一个目标,该目标通常与第一个特征正相关,并带有一些噪声。应用单调约束允许预测捕获第一个特征的全局影响,而不是拟合噪声。有关用例示例,请参阅直方图梯度提升树中的特征。
import numpy as np
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split
# from sklearn.inspection import plot_partial_dependence
from sklearn.inspection import PartialDependenceDisplay
from sklearn.ensemble import HistGradientBoostingRegressor
n_samples = 500
rng = np.random.RandomState(0)
X = rng.randn(n_samples, 2)
noise = rng.normal(loc=0.0, scale=0.01, size=n_samples)
y = 5 * X[:, 0] + np.sin(10 * np.pi * X[:, 0]) - noise
gbdt_no_cst = HistGradientBoostingRegressor().fit(X, y)
gbdt_cst = HistGradientBoostingRegressor(monotonic_cst=[1, 0]).fit(X, y)
# plot_partial_dependence has been removed in version 1.2. From 1.2, use
# PartialDependenceDisplay instead.
# disp = plot_partial_dependence(
disp = PartialDependenceDisplay.from_estimator(
gbdt_no_cst,
X,
features=[0],
feature_names=["feature 0"],
line_kw={"linewidth": 4, "label": "unconstrained", "color": "tab:blue"},
)
# plot_partial_dependence(
PartialDependenceDisplay.from_estimator(
gbdt_cst,
X,
features=[0],
line_kw={"linewidth": 4, "label": "constrained", "color": "tab:orange"},
ax=disp.axes_,
)
disp.axes_[0, 0].plot(
X[:, 0], y, "o", alpha=0.5, zorder=-1, label="samples", color="tab:green"
)
disp.axes_[0, 0].set_ylim(-3, 3)
disp.axes_[0, 0].set_xlim(-1, 1)
plt.legend()
plt.show()
Lasso 和 ElasticNet 的样本权重支持#
两个线性回归器Lasso
和 ElasticNet
现在支持样本权重。
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_regression
from sklearn.linear_model import Lasso
import numpy as np
n_samples, n_features = 1000, 20
rng = np.random.RandomState(0)
X, y = make_regression(n_samples, n_features, random_state=rng)
sample_weight = rng.rand(n_samples)
X_train, X_test, y_train, y_test, sw_train, sw_test = train_test_split(
X, y, sample_weight, random_state=rng
)
reg = Lasso()
reg.fit(X_train, y_train, sample_weight=sw_train)
print(reg.score(X_test, y_test, sw_test))
0.999791942438998
脚本的总运行时间:(0 分钟 0.714 秒)
相关示例