注意

转到末尾下载完整示例代码。或者通过JupyterLite或Binder在浏览器中运行此示例

Lasso 模型选择：AIC-BIC / 交叉验证#

本示例重点介绍Lasso模型的模型选择，Lasso模型是用于回归问题的带有L1惩罚的线性模型。

实际上，可以使用几种策略来选择正则化参数的值：通过交叉验证或使用信息准则，即AIC或BIC。

接下来，我们将详细讨论不同的策略。

# Authors: The scikit-learn developers
# SPDX-License-Identifier: BSD-3-Clause

数据集#

在本示例中，我们将使用糖尿病数据集。

from sklearn.datasets import load_diabetes

X, y = load_diabetes(return_X_y=True, as_frame=True)
X.head()

	年龄	性别	身体质量指数	血压	s1	s2	s3	s4	s5	s6
0	0.038076	0.050680	0.061696	0.021872	-0.044223	-0.034821	-0.043401	-0.002592	0.019907	-0.017646
1	-0.001882	-0.044642	-0.051474	-0.026328	-0.008449	-0.019163	0.074412	-0.039493	-0.068332	-0.092204
2	0.085299	0.050680	0.044451	-0.005670	-0.045599	-0.034194	-0.032356	-0.002592	0.002861	-0.025930
3	-0.089063	-0.044642	-0.011595	-0.036656	0.012191	0.024991	-0.036038	0.034309	0.022688	-0.009362
4	0.005383	-0.044642	-0.036385	0.021872	0.003935	0.015596	0.008142	-0.002592	-0.031988	-0.046641

此外，我们在原始数据中添加了一些随机特征，以更好地说明Lasso模型执行的特征选择。

import numpy as np
import pandas as pd

rng = np.random.RandomState(42)
n_random_features = 14
X_random = pd.DataFrame(
    rng.randn(X.shape[0], n_random_features),
    columns=[f"random_{i:02d}" for i in range(n_random_features)],
)
X = pd.concat([X, X_random], axis=1)
# Show only a subset of the columns
X[X.columns[::3]].head()

	年龄	血压	s3	s6	random_02	random_05	random_08	random_11
0	0.038076	0.021872	-0.043401	-0.017646	0.647689	-0.234137	-0.469474	-0.465730
1	-0.001882	-0.026328	0.074412	-0.092204	-1.012831	-1.412304	0.067528	0.110923
2	0.085299	-0.005670	-0.032356	-0.025930	-0.601707	-1.057711	0.208864	0.196861
3	-0.089063	-0.036656	-0.036038	-0.009362	-1.478522	1.057122	0.324084	0.611676
4	0.005383	0.021872	0.008142	-0.046641	0.331263	-0.185659	0.812526	1.003533

通过信息准则选择Lasso#

LassoLarsIC 提供了一个Lasso估计器，它使用赤池信息准则（AIC）或贝叶斯信息准则（BIC）来选择正则化参数alpha的最佳值。

在拟合模型之前，我们将使用 StandardScaler 对数据进行标准化。此外，我们将测量拟合和调优超参数alpha所需的时间，以便与交叉验证策略进行比较。

我们将首先使用AIC准则拟合Lasso模型。

import time

from sklearn.linear_model import LassoLarsIC
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

start_time = time.time()
lasso_lars_ic = make_pipeline(StandardScaler(), LassoLarsIC(criterion="aic")).fit(X, y)
fit_time = time.time() - start_time

我们存储在 fit 期间使用的每个alpha值的AIC指标。

results = pd.DataFrame(
    {
        "alphas": lasso_lars_ic[-1].alphas_,
        "AIC criterion": lasso_lars_ic[-1].criterion_,
    }
).set_index("alphas")
alpha_aic = lasso_lars_ic[-1].alpha_

现在，我们使用BIC准则执行相同的分析。

lasso_lars_ic.set_params(lassolarsic__criterion="bic").fit(X, y)
results["BIC criterion"] = lasso_lars_ic[-1].criterion_
alpha_bic = lasso_lars_ic[-1].alpha_

我们可以检查哪个 alpha 值导致最小的AIC和BIC。

def highlight_min(x):
    x_min = x.min()
    return ["font-weight: bold" if v == x_min else "" for v in x]


results.style.apply(highlight_min)

	AIC准则	BIC准则
alpha值
45.160030	5244.764779	5244.764779
42.300343	5208.250639	5212.341949
21.542052	4928.018900	4936.201520
15.034077	4869.678359	4881.952289
6.189631	4815.437362	4831.802601
5.329616	4810.423641	4830.880191
4.306012	4803.573491	4828.121351
4.124225	4804.126502	4832.765671
3.820705	4803.621645	4836.352124
3.750389	4805.012521	4841.834310
3.570655	4805.290075	4846.203174
3.550213	4807.075887	4852.080295
3.358295	4806.878051	4855.973770
3.259297	4807.706026	4860.893055
3.237703	4809.440409	4866.718747
2.850031	4805.989341	4867.358990
2.384338	4801.702266	4867.163224
2.296575	4802.594754	4872.147022
2.031555	4801.236720	4874.880298
1.618263	4798.484109	4876.218997
1.526599	4799.543841	4881.370039
0.586798	4794.238744	4880.156252
0.445978	4795.589715	4885.598533
0.259031	4796.966981	4891.067109
0.032179	4796.662409	4894.853846
0.019069	4794.652739	4888.752867
0.000000	4796.626286	4894.817724

最后，我们可以绘制不同alpha值的AIC和BIC。图中垂直线对应于每个准则选择的alpha。所选alpha对应于AIC或BIC准则的最小值。

ax = results.plot()
ax.vlines(
    alpha_aic,
    results["AIC criterion"].min(),
    results["AIC criterion"].max(),
    label="alpha: AIC estimate",
    linestyles="--",
    color="tab:blue",
)
ax.vlines(
    alpha_bic,
    results["BIC criterion"].min(),
    results["BIC criterion"].max(),
    label="alpha: BIC estimate",
    linestyle="--",
    color="tab:orange",
)
ax.set_xlabel(r"$\alpha$")
ax.set_ylabel("criterion")
ax.set_xscale("log")
ax.legend()
_ = ax.set_title(
    f"Information-criterion for model selection (training time {fit_time:.2f}s)"
)

Information-criterion for model selection (training time 0.01s)

使用信息准则进行模型选择非常快。它依赖于在提供给 fit 的样本内数据集上计算准则。这两个准则都基于训练集误差估计模型的泛化误差，并惩罚这种过于乐观的误差。然而，这种惩罚依赖于对自由度和噪声方差的正确估计。两者都是为大样本（渐近结果）推导的，并且假设模型是正确的，即数据实际上是由该模型生成的。

当问题病态（特征多于样本）时，这些模型也容易失效。此时需要提供噪声方差的估计。

通过交叉验证选择Lasso#

Lasso估计器可以使用不同的求解器实现：坐标下降法和最小角回归。它们在执行速度和数值误差来源方面有所不同。

在scikit-learn中，有两个集成交叉验证的估计器可用：LassoCV 和 LassoLarsCV，它们分别通过坐标下降法和最小角回归解决问题。

在本节的其余部分，我们将介绍这两种方法。对于这两种算法，我们将使用20折交叉验证策略。

通过坐标下降法的Lasso#

让我们首先使用 LassoCV 进行超参数调优。

from sklearn.linear_model import LassoCV

start_time = time.time()
model = make_pipeline(StandardScaler(), LassoCV(cv=20)).fit(X, y)
fit_time = time.time() - start_time

import matplotlib.pyplot as plt

ymin, ymax = 2300, 3800
lasso = model[-1]
plt.semilogx(lasso.alphas_, lasso.mse_path_, linestyle=":")
plt.plot(
    lasso.alphas_,
    lasso.mse_path_.mean(axis=-1),
    color="black",
    label="Average across the folds",
    linewidth=2,
)
plt.axvline(lasso.alpha_, linestyle="--", color="black", label="alpha: CV estimate")

plt.ylim(ymin, ymax)
plt.xlabel(r"$\alpha$")
plt.ylabel("Mean square error")
plt.legend()
_ = plt.title(
    f"Mean square error on each fold: coordinate descent (train time: {fit_time:.2f}s)"
)

Mean square error on each fold: coordinate descent (train time: 0.24s)

通过最小角回归的Lasso#

让我们首先使用 LassoLarsCV 进行超参数调优。

from sklearn.linear_model import LassoLarsCV

start_time = time.time()
model = make_pipeline(StandardScaler(), LassoLarsCV(cv=20)).fit(X, y)
fit_time = time.time() - start_time

lasso = model[-1]
plt.semilogx(lasso.cv_alphas_, lasso.mse_path_, ":")
plt.semilogx(
    lasso.cv_alphas_,
    lasso.mse_path_.mean(axis=-1),
    color="black",
    label="Average across the folds",
    linewidth=2,
)
plt.axvline(lasso.alpha_, linestyle="--", color="black", label="alpha CV")

plt.ylim(ymin, ymax)
plt.xlabel(r"$\alpha$")
plt.ylabel("Mean square error")
plt.legend()
_ = plt.title(f"Mean square error on each fold: Lars (train time: {fit_time:.2f}s)")

Mean square error on each fold: Lars (train time: 0.07s)

交叉验证方法的总结#

两种算法的结果大致相同。

Lars仅在路径中的每个“扭结”处计算解决方案路径。因此，当“扭结”数量很少时（即特征或样本数量很少时），它非常高效。此外，它能够在不设置任何超参数的情况下计算完整路径。相反，坐标下降法则在预先指定的网格上（此处使用默认设置）计算路径点。因此，如果网格点数量小于路径中的“扭结”数量，则它更高效。如果特征数量非常大且每个交叉验证折叠中都有足够的样本可供选择，这种策略可能会很有趣。在数值误差方面，对于高度相关的变量，Lars会累积更多误差，而坐标下降算法只会在网格上采样路径。

请注意alpha的最佳值在每个折叠中如何变化。这说明了为什么在评估通过交叉验证选择参数的方法的性能时，嵌套交叉验证是一个好的策略：因为仅在未见过测试集上进行最终评估时，这种参数选择可能不是最优的。

结论#

在本教程中，我们介绍了两种选择最佳超参数 alpha 的方法：一种策略仅使用训练集和一些信息准则来找到 alpha 的最优值，另一种策略则基于交叉验证。

在本示例中，两种方法的工作效果相似。样本内超参数选择甚至在计算性能方面也显示出其效率。但是，它只能在样本数量相对于特征数量足够大时使用。

这就是为什么通过交叉验证进行超参数优化是一种安全的策略：它适用于不同的设置。

脚本总运行时间： (0 分钟 0.901 秒)