注意
跳转至页面底部 下载完整示例代码,或通过 JupyterLite/Binder 在浏览器中运行此示例。
稠密与稀疏数据上的 Lasso#
我们展示了 linear_model.Lasso 在稠密数据和稀疏数据上能提供相同的结果,并且在处理稀疏数据时,速度会得到提升。
# Authors: The scikit-learn developers
# SPDX-License-Identifier: BSD-3-Clause
from time import time
from scipy import linalg, sparse
from sklearn.datasets import make_regression
from sklearn.linear_model import Lasso
在稠密数据上比较两种 Lasso 实现#
我们创建了一个适用于 Lasso 的线性回归问题,即特征数量多于样本数量的情况。然后,我们分别以稠密(通常情况)和稀疏格式存储数据矩阵,并对二者分别进行 Lasso 训练。我们计算两者的运行时间,并通过计算它们学到的系数之差的欧几里得范数,来验证它们学习到的是否为同一个模型。由于数据是稠密的,我们预期使用稠密数据格式会有更好的运行时间。
X, y = make_regression(n_samples=200, n_features=5000, random_state=0)
# create a copy of X in sparse format
X_sp = sparse.coo_array(X)
alpha = 1
sparse_lasso = Lasso(alpha=alpha, fit_intercept=False, max_iter=1000)
dense_lasso = Lasso(alpha=alpha, fit_intercept=False, max_iter=1000)
t0 = time()
sparse_lasso.fit(X_sp, y)
print(f"Sparse Lasso done in {(time() - t0):.3f}s")
t0 = time()
dense_lasso.fit(X, y)
print(f"Dense Lasso done in {(time() - t0):.3f}s")
# compare the regression coefficients
coeff_diff = linalg.norm(sparse_lasso.coef_ - dense_lasso.coef_)
print(f"Distance between coefficients : {coeff_diff:.2e}")
#
Sparse Lasso done in 0.109s
Dense Lasso done in 0.040s
Distance between coefficients : 5.23e-14
在稀疏数据上比较两种 Lasso 实现#
我们通过将所有较小的值替换为 0,使上述问题变得稀疏,并执行与上述相同的比较。由于现在数据是稀疏的,我们预期使用稀疏数据格式的实现会更快。
# make a copy of the previous data
Xs = X.copy()
# make Xs sparse by replacing the values lower than 2.5 with 0s
Xs[Xs < 2.5] = 0.0
# create a copy of Xs in sparse format
Xs_sp = sparse.coo_array(Xs)
Xs_sp = Xs_sp.tocsc()
# compute the proportion of non-zero coefficient in the data matrix
print(f"Matrix density : {(Xs_sp.nnz / float(X.size) * 100):.3f}%")
alpha = 0.1
sparse_lasso = Lasso(alpha=alpha, fit_intercept=False, max_iter=10000)
dense_lasso = Lasso(alpha=alpha, fit_intercept=False, max_iter=10000)
t0 = time()
sparse_lasso.fit(Xs_sp, y)
print(f"Sparse Lasso done in {(time() - t0):.3f}s")
t0 = time()
dense_lasso.fit(Xs, y)
print(f"Dense Lasso done in {(time() - t0):.3f}s")
# compare the regression coefficients
coeff_diff = linalg.norm(sparse_lasso.coef_ - dense_lasso.coef_)
print(f"Distance between coefficients : {coeff_diff:.2e}")
Matrix density : 0.626%
Sparse Lasso done in 0.151s
Dense Lasso done in 0.985s
Distance between coefficients : 3.85e-13
脚本运行总时长:(0 分 1.364 秒)
相关示例