密集数据和稀疏数据上的 Lasso#

我们展示了 linear_model.Lasso 为密集数据和稀疏数据提供了相同的结果,并且在稀疏数据的情况下,速度得到了提高。

# Authors: The scikit-learn developers
# SPDX-License-Identifier: BSD-3-Clause

from time import time

from scipy import linalg, sparse

from sklearn.datasets import make_regression
from sklearn.linear_model import Lasso

比较密集数据上两种 Lasso 实现#

我们创建了一个适合 Lasso 的线性回归问题,也就是说,特征数量多于样本数量。然后,我们将数据矩阵存储在密集(通常)和稀疏格式中,并在每个矩阵上训练一个 Lasso。我们计算两者的运行时间,并通过计算它们学习的系数之间欧几里德范数的差来检查它们是否学习了相同的模型。因为数据是密集的,我们预计密集数据格式的运行时间会更好。

X, y = make_regression(n_samples=200, n_features=5000, random_state=0)
# create a copy of X in sparse format
X_sp = sparse.coo_matrix(X)

alpha = 1
sparse_lasso = Lasso(alpha=alpha, fit_intercept=False, max_iter=1000)
dense_lasso = Lasso(alpha=alpha, fit_intercept=False, max_iter=1000)

t0 = time()
sparse_lasso.fit(X_sp, y)
print(f"Sparse Lasso done in {(time() - t0):.3f}s")

t0 = time()
dense_lasso.fit(X, y)
print(f"Dense Lasso done in {(time() - t0):.3f}s")

# compare the regression coefficients
coeff_diff = linalg.norm(sparse_lasso.coef_ - dense_lasso.coef_)
print(f"Distance between coefficients : {coeff_diff:.2e}")

#
Sparse Lasso done in 0.110s
Dense Lasso done in 0.038s
Distance between coefficients : 1.01e-13

比较稀疏数据上两种 Lasso 实现#

我们通过将所有小值替换为 0 来使先前的问题稀疏化,并运行与上面相同的比较。因为数据现在是稀疏的,我们预计使用稀疏数据格式的实现速度会更快。

# make a copy of the previous data
Xs = X.copy()
# make Xs sparse by replacing the values lower than 2.5 with 0s
Xs[Xs < 2.5] = 0.0
# create a copy of Xs in sparse format
Xs_sp = sparse.coo_matrix(Xs)
Xs_sp = Xs_sp.tocsc()

# compute the proportion of non-zero coefficient in the data matrix
print(f"Matrix density : {(Xs_sp.nnz / float(X.size) * 100):.3f}%")

alpha = 0.1
sparse_lasso = Lasso(alpha=alpha, fit_intercept=False, max_iter=10000)
dense_lasso = Lasso(alpha=alpha, fit_intercept=False, max_iter=10000)

t0 = time()
sparse_lasso.fit(Xs_sp, y)
print(f"Sparse Lasso done in {(time() - t0):.3f}s")

t0 = time()
dense_lasso.fit(Xs, y)
print(f"Dense Lasso done in  {(time() - t0):.3f}s")

# compare the regression coefficients
coeff_diff = linalg.norm(sparse_lasso.coef_ - dense_lasso.coef_)
print(f"Distance between coefficients : {coeff_diff:.2e}")
Matrix density : 0.626%
Sparse Lasso done in 0.200s
Dense Lasso done in  0.742s
Distance between coefficients : 8.65e-12

脚本总运行时间:(0 分钟 1.159 秒)

相关示例

基于 L1 的稀疏信号模型

基于 L1 的稀疏信号模型

使用多任务 Lasso 进行联合特征选择

使用多任务 Lasso 进行联合特征选择

Lasso、Lasso-LARS 和弹性网络路径

Lasso、Lasso-LARS 和弹性网络路径

Lasso 模型选择:AIC-BIC / 交叉验证

Lasso 模型选择:AIC-BIC / 交叉验证

由 Sphinx-Gallery 生成的图库