注意

转到末尾下载完整的示例代码或通过 JupyterLite 或 Binder 在浏览器中运行此示例。

泊松回归和非正态损失#

本例演示了对来自 [1] 的法国机动车第三方责任索赔数据集使用对数线性泊松回归，并将其与使用常用最小平方误差拟合的线性模型以及使用泊松损失（和对数链接函数）拟合的非线性 GBRT 模型进行比较。

一些定义

保单 (policy) 是保险公司与个人（在本例中为车辆驾驶员，即投保人 (policyholder)）之间的合同。
索赔 (claim) 是投保人向保险公司提出的，要求赔偿保险承保的损失的请求。
敞口 (exposure) 是给定保单的保险覆盖期限，以年为单位。
索赔频率 (frequency) 是索赔次数除以敞口，通常以每年索赔次数衡量。

在此数据集中，每个样本对应一份保险保单。可用特征包括驾驶员年龄、车辆年龄、车辆功率等。

我们的目标是根据投保人群体的历史数据，预测新投保人发生车祸索赔的预期频率。

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

法国机动车第三方责任索赔数据集#

让我们从 OpenML 加载机动车索赔数据集：https://www.openml.org/d/41214

from sklearn.datasets import fetch_openml

df = fetch_openml(data_id=41214, as_frame=True).frame
df

	IDpol	ClaimNb	Exposure	Area	VehPower	VehAge	DrivAge	BonusMalus	VehBrand	VehGas	Density	Region
0	1.0	1	0.10000	D	5	0	55	50	B12	'Regular'	1217	R82
1	3.0	1	0.77000	D	5	0	55	50	B12	'Regular'	1217	R82
2	5.0	1	0.75000	B	6	2	52	50	B12	'Diesel'	54	R22
3	10.0	1	0.09000	B	7	0	46	50	B12	'Diesel'	76	R72
4	11.0	1	0.84000	B	7	0	46	50	B12	'Diesel'	76	R72
...	...	...	...	...	...	...	...	...	...	...	...	...
678008	6114326.0	0	0.00274	E	4	0	54	50	B12	'Regular'	3317	R93
678009	6114327.0	0	0.00274	E	4	0	41	95	B12	'Regular'	9850	R11
678010	6114328.0	0	0.00274	D	6	2	45	50	B12	'Diesel'	1323	R82
678011	6114329.0	0	0.00274	B	4	0	60	50	B12	'Regular'	95	R26
678012	6114330.0	0	0.00274	B	7	6	29	54	B12	'Diesel'	65	R72

678013 rows × 12 columns

索赔次数 (ClaimNb) 是一个正整数，可以建模为泊松分布。它被假定为在给定时间间隔 (Exposure，以年为单位) 内以恒定速率发生的离散事件数。

在这里，我们想通过一个（缩放的）泊松分布来建模条件于 X 的频率 y = ClaimNb / Exposure，并使用 Exposure 作为 sample_weight。

df["Frequency"] = df["ClaimNb"] / df["Exposure"]

print(
    "Average Frequency = {}".format(np.average(df["Frequency"], weights=df["Exposure"]))
)

print(
    "Fraction of exposure with zero claims = {0:.1%}".format(
        df.loc[df["ClaimNb"] == 0, "Exposure"].sum() / df["Exposure"].sum()
    )
)

fig, (ax0, ax1, ax2) = plt.subplots(ncols=3, figsize=(16, 4))
ax0.set_title("Number of claims")
_ = df["ClaimNb"].hist(bins=30, log=True, ax=ax0)
ax1.set_title("Exposure in years")
_ = df["Exposure"].hist(bins=30, log=True, ax=ax1)
ax2.set_title("Frequency (number of claims per year)")
_ = df["Frequency"].hist(bins=30, log=True, ax=ax2)

Number of claims, Exposure in years, Frequency (number of claims per year)

Average Frequency = 0.10070308464041305
Fraction of exposure with zero claims = 93.9%

剩余的列可用于预测索赔事件的频率。这些列非常异构，混合了不同尺度、分布可能非常不均匀的分类变量和数值变量。

为了使用这些预测变量拟合线性模型，因此有必要执行如下标准特征转换

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import (
    FunctionTransformer,
    KBinsDiscretizer,
    OneHotEncoder,
    StandardScaler,
)

log_scale_transformer = make_pipeline(
    FunctionTransformer(np.log, validate=False), StandardScaler()
)

linear_model_preprocessor = ColumnTransformer(
    [
        ("passthrough_numeric", "passthrough", ["BonusMalus"]),
        (
            "binned_numeric",
            KBinsDiscretizer(
                n_bins=10, quantile_method="averaged_inverted_cdf", random_state=0
            ),
            ["VehAge", "DrivAge"],
        ),
        ("log_scaled_numeric", log_scale_transformer, ["Density"]),
        (
            "onehot_categorical",
            OneHotEncoder(),
            ["VehBrand", "VehPower", "VehGas", "Region", "Area"],
        ),
    ],
    remainder="drop",
)

恒定预测基线#

值得注意的是，超过 93% 的投保人没有索赔。如果我们将此问题转换为二元分类任务，它将严重不平衡，即使是只预测均值的简单模型也能达到 93% 的准确率。

为了评估所用指标的相关性，我们将考虑一个“虚拟”估计器作为基线，它始终预测训练样本的平均频率。

from sklearn.dummy import DummyRegressor
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline

df_train, df_test = train_test_split(df, test_size=0.33, random_state=0)

dummy = Pipeline(
    [
        ("preprocessor", linear_model_preprocessor),
        ("regressor", DummyRegressor(strategy="mean")),
    ]
).fit(df_train, df_train["Frequency"], regressor__sample_weight=df_train["Exposure"])

让我们使用 3 个不同的回归指标计算此恒定预测基线的性能

from sklearn.metrics import (
    mean_absolute_error,
    mean_poisson_deviance,
    mean_squared_error,
)


def score_estimator(estimator, df_test):
    """Score an estimator on the test set."""
    y_pred = estimator.predict(df_test)

    print(
        "MSE: %.3f"
        % mean_squared_error(
            df_test["Frequency"], y_pred, sample_weight=df_test["Exposure"]
        )
    )
    print(
        "MAE: %.3f"
        % mean_absolute_error(
            df_test["Frequency"], y_pred, sample_weight=df_test["Exposure"]
        )
    )

    # Ignore non-positive predictions, as they are invalid for
    # the Poisson deviance.
    mask = y_pred > 0
    if (~mask).any():
        n_masked, n_samples = (~mask).sum(), mask.shape[0]
        print(
            "WARNING: Estimator yields invalid, non-positive predictions "
            f" for {n_masked} samples out of {n_samples}. These predictions "
            "are ignored when computing the Poisson deviance."
        )

    print(
        "mean Poisson deviance: %.3f"
        % mean_poisson_deviance(
            df_test["Frequency"][mask],
            y_pred[mask],
            sample_weight=df_test["Exposure"][mask],
        )
    )


print("Constant mean frequency evaluation:")
score_estimator(dummy, df_test)

Constant mean frequency evaluation:
MSE: 0.564
MAE: 0.189
mean Poisson deviance: 0.625

（广义）线性模型#

我们首先使用 (l2 惩罚的) 最小二乘线性回归模型（更广为人知的是 Ridge 回归）来建模目标变量。我们使用较低的惩罚 alpha，因为我们预计这种线性模型在此类大型数据集上会欠拟合。

from sklearn.linear_model import Ridge

ridge_glm = Pipeline(
    [
        ("preprocessor", linear_model_preprocessor),
        ("regressor", Ridge(alpha=1e-6)),
    ]
).fit(df_train, df_train["Frequency"], regressor__sample_weight=df_train["Exposure"])

泊松离差不能在模型预测的非正值上计算。对于确实返回少量非正预测的模型（例如 Ridge），我们忽略相应的样本，这意味着获得的泊松离差是近似的。另一种方法是使用 TransformedTargetRegressor 元估计器将 y_pred 映射到严格正域。

print("Ridge evaluation:")
score_estimator(ridge_glm, df_test)

Ridge evaluation:
MSE: 0.560
MAE: 0.186
WARNING: Estimator yields invalid, non-positive predictions  for 595 samples out of 223745. These predictions are ignored when computing the Poisson deviance.
mean Poisson deviance: 0.597

接下来，我们在目标变量上拟合泊松回归器。我们将正则化强度 alpha 设置为大约 1e-6 除以样本数（即 1e-12），以模仿 Ridge 回归器，后者的 L2 惩罚项随样本数而异。

由于泊松回归器在内部建模的是预期目标值的对数，而不是直接建模预期值（对数 vs 恒等链接函数），因此 X 和 y 之间的关系不再是严格线性的。因此，泊松回归器被称为广义线性模型 (GLM)，而不是像 Ridge 回归那样是普通线性模型。

from sklearn.linear_model import PoissonRegressor

n_samples = df_train.shape[0]

poisson_glm = Pipeline(
    [
        ("preprocessor", linear_model_preprocessor),
        ("regressor", PoissonRegressor(alpha=1e-12, solver="newton-cholesky")),
    ]
)
poisson_glm.fit(
    df_train, df_train["Frequency"], regressor__sample_weight=df_train["Exposure"]
)

print("PoissonRegressor evaluation:")
score_estimator(poisson_glm, df_test)

PoissonRegressor evaluation:
MSE: 0.560
MAE: 0.186
mean Poisson deviance: 0.594

用于泊松回归的梯度提升回归树#

最后，我们将考虑一个非线性模型，即梯度提升回归树。基于树的模型不需要对分类数据进行 one-hot 编码：相反，我们可以使用 OrdinalEncoder 为每个类别标签编码一个任意整数。使用这种编码，树会将分类特征视为有序特征，这可能并不总是期望的行为。然而，对于足够深的树，这种影响是有限的，因为它们能够恢复特征的分类性质。OrdinalEncoder 相对于 OneHotEncoder 的主要优点是它会加快训练速度。

梯度提升还提供了使用泊松损失（带有隐式对数链接函数）而不是默认最小二乘损失来拟合树的可能性。在这里，我们只拟合带有泊松损失的树，以使本例保持简洁。

from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.preprocessing import OrdinalEncoder

tree_preprocessor = ColumnTransformer(
    [
        (
            "categorical",
            OrdinalEncoder(),
            ["VehBrand", "VehPower", "VehGas", "Region", "Area"],
        ),
        ("numeric", "passthrough", ["VehAge", "DrivAge", "BonusMalus", "Density"]),
    ],
    remainder="drop",
)
poisson_gbrt = Pipeline(
    [
        ("preprocessor", tree_preprocessor),
        (
            "regressor",
            HistGradientBoostingRegressor(loss="poisson", max_leaf_nodes=128),
        ),
    ]
)
poisson_gbrt.fit(
    df_train, df_train["Frequency"], regressor__sample_weight=df_train["Exposure"]
)

print("Poisson Gradient Boosted Trees evaluation:")
score_estimator(poisson_gbrt, df_test)

Poisson Gradient Boosted Trees evaluation:
MSE: 0.566
MAE: 0.184
mean Poisson deviance: 0.575

与上面的泊松 GLM 一样，梯度提升树模型最小化泊松离差。然而，由于更高的预测能力，它达到了更低的泊松离差值。

用单个训练/测试拆分评估模型容易受到随机波动的影响。如果计算资源允许，应验证交叉验证的性能指标是否会得出相似的结论。

这些模型之间的定性差异也可以通过比较观察到的目标值直方图与预测值直方图来可视化

fig, axes = plt.subplots(nrows=2, ncols=4, figsize=(16, 6), sharey=True)
fig.subplots_adjust(bottom=0.2)
n_bins = 20
for row_idx, label, df in zip(range(2), ["train", "test"], [df_train, df_test]):
    df["Frequency"].hist(bins=np.linspace(-1, 30, n_bins), ax=axes[row_idx, 0])

    axes[row_idx, 0].set_title("Data")
    axes[row_idx, 0].set_yscale("log")
    axes[row_idx, 0].set_xlabel("y (observed Frequency)")
    axes[row_idx, 0].set_ylim([1e1, 5e5])
    axes[row_idx, 0].set_ylabel(label + " samples")

    for idx, model in enumerate([ridge_glm, poisson_glm, poisson_gbrt]):
        y_pred = model.predict(df)

        pd.Series(y_pred).hist(
            bins=np.linspace(-1, 4, n_bins), ax=axes[row_idx, idx + 1]
        )
        axes[row_idx, idx + 1].set(
            title=model[-1].__class__.__name__,
            yscale="log",
            xlabel="y_pred (predicted expected Frequency)",
        )
plt.tight_layout()

Data, Ridge, PoissonRegressor, HistGradientBoostingRegressor, Data, Ridge, PoissonRegressor, HistGradientBoostingRegressor

实验数据呈现出 y 的长尾分布。在所有模型中，我们预测随机变量的预期频率，因此我们将必然拥有比该随机变量的观察实现值更少的极值。这解释了模型预测直方图的众数不一定对应于最小值。此外，Ridge 中使用的正态分布具有恒定方差，而 PoissonRegressor 和 HistGradientBoostingRegressor 中使用的泊松分布，其方差与预测的预期值成比例。

因此，在所考虑的估计器中，与对目标变量分布做出错误假设的 Ridge 模型相比，PoissonRegressor 和 HistGradientBoostingRegressor 先验地更适合建模非负数据的长尾分布。

HistGradientBoostingRegressor 估计器具有最大的灵活性，能够预测更高的预期值。

请注意，我们本可以对 HistGradientBoostingRegressor 模型使用最小二乘损失。这将错误地假设正态分布的响应变量，就像 Ridge 模型一样，并且也可能导致轻微的负预测。然而，由于树的灵活性以及大量的训练样本，梯度提升树仍然表现相对良好，尤其是优于 PoissonRegressor。

预测校准评估#

为了确保估计器对不同类型的投保人产生合理的预测，我们可以根据每个模型返回的 y_pred 对测试样本进行分箱。然后对于每个分箱，我们比较平均预测 y_pred 与平均观察到的目标

from sklearn.utils import gen_even_slices


def _mean_frequency_by_risk_group(y_true, y_pred, sample_weight=None, n_bins=100):
    """Compare predictions and observations for bins ordered by y_pred.

    We order the samples by ``y_pred`` and split it in bins.
    In each bin the observed mean is compared with the predicted mean.

    Parameters
    ----------
    y_true: array-like of shape (n_samples,)
        Ground truth (correct) target values.
    y_pred: array-like of shape (n_samples,)
        Estimated target values.
    sample_weight : array-like of shape (n_samples,)
        Sample weights.
    n_bins: int
        Number of bins to use.

    Returns
    -------
    bin_centers: ndarray of shape (n_bins,)
        bin centers
    y_true_bin: ndarray of shape (n_bins,)
        average y_pred for each bin
    y_pred_bin: ndarray of shape (n_bins,)
        average y_pred for each bin
    """
    idx_sort = np.argsort(y_pred)
    bin_centers = np.arange(0, 1, 1 / n_bins) + 0.5 / n_bins
    y_pred_bin = np.zeros(n_bins)
    y_true_bin = np.zeros(n_bins)

    for n, sl in enumerate(gen_even_slices(len(y_true), n_bins)):
        weights = sample_weight[idx_sort][sl]
        y_pred_bin[n] = np.average(y_pred[idx_sort][sl], weights=weights)
        y_true_bin[n] = np.average(y_true[idx_sort][sl], weights=weights)
    return bin_centers, y_true_bin, y_pred_bin


print(f"Actual number of claims: {df_test['ClaimNb'].sum()}")
fig, ax = plt.subplots(nrows=2, ncols=2, figsize=(12, 8))
plt.subplots_adjust(wspace=0.3)

for axi, model in zip(ax.ravel(), [ridge_glm, poisson_glm, poisson_gbrt, dummy]):
    y_pred = model.predict(df_test)
    y_true = df_test["Frequency"].values
    exposure = df_test["Exposure"].values
    q, y_true_seg, y_pred_seg = _mean_frequency_by_risk_group(
        y_true, y_pred, sample_weight=exposure, n_bins=10
    )

    # Name of the model after the estimator used in the last step of the
    # pipeline.
    print(f"Predicted number of claims by {model[-1]}: {np.sum(y_pred * exposure):.1f}")

    axi.plot(q, y_pred_seg, marker="x", linestyle="--", label="predictions")
    axi.plot(q, y_true_seg, marker="o", linestyle="--", label="observations")
    axi.set_xlim(0, 1.0)
    axi.set_ylim(0, 0.5)
    axi.set(
        title=model[-1],
        xlabel="Fraction of samples sorted by y_pred",
        ylabel="Mean Frequency (y_pred)",
    )
    axi.legend()
plt.tight_layout()

Ridge(alpha=1e-06), PoissonRegressor(alpha=1e-12, solver='newton-cholesky'), HistGradientBoostingRegressor(loss='poisson', max_leaf_nodes=128), DummyRegressor()

Actual number of claims: 11935
Predicted number of claims by Ridge(alpha=1e-06): 11933.4
Predicted number of claims by PoissonRegressor(alpha=1e-12, solver='newton-cholesky'): 11932.0
Predicted number of claims by HistGradientBoostingRegressor(loss='poisson', max_leaf_nodes=128): 12196.1
Predicted number of claims by DummyRegressor(): 11931.2

虚拟回归模型预测恒定频率。该模型并未将所有样本都归为同一等级，但它在全局上校准良好（用于估计整个人口的平均频率）。

Ridge 回归模型可以预测非常低的预期频率，这与数据不符。因此，它可能会严重低估某些投保人的风险。

PoissonRegressor 和 HistGradientBoostingRegressor 在预测目标和观察目标之间显示出更好的一致性，特别是对于低预测目标值。

所有预测的总和也证实了 Ridge 模型的校准问题：它低估了测试集中总索赔次数超过 3%，而其他三个模型可以近似恢复测试组合的总索赔次数。

排名能力评估#

对于某些业务应用，我们对模型对风险最高和最安全的投保人进行排名的能力感兴趣，而不管预测的绝对值如何。在这种情况下，模型评估将问题视为排名问题而不是回归问题。

为了从这个角度比较这 3 个模型，可以绘制索赔的累积比例与敞口的累积比例，其中测试样本按模型预测从最安全到风险最高排序。

该图称为洛伦兹曲线，可以用基尼指数来概括

from sklearn.metrics import auc


def lorenz_curve(y_true, y_pred, exposure):
    y_true, y_pred = np.asarray(y_true), np.asarray(y_pred)
    exposure = np.asarray(exposure)

    # order samples by increasing predicted risk:
    ranking = np.argsort(y_pred)
    ranked_frequencies = y_true[ranking]
    ranked_exposure = exposure[ranking]
    cumulated_claims = np.cumsum(ranked_frequencies * ranked_exposure)
    cumulated_claims /= cumulated_claims[-1]
    cumulated_exposure = np.cumsum(ranked_exposure)
    cumulated_exposure /= cumulated_exposure[-1]
    return cumulated_exposure, cumulated_claims


fig, ax = plt.subplots(figsize=(8, 8))

for model in [dummy, ridge_glm, poisson_glm, poisson_gbrt]:
    y_pred = model.predict(df_test)
    cum_exposure, cum_claims = lorenz_curve(
        df_test["Frequency"], y_pred, df_test["Exposure"]
    )
    gini = 1 - 2 * auc(cum_exposure, cum_claims)
    label = "{} (Gini: {:.2f})".format(model[-1], gini)
    ax.plot(cum_exposure, cum_claims, linestyle="-", label=label)

# Oracle model: y_pred == y_test
cum_exposure, cum_claims = lorenz_curve(
    df_test["Frequency"], df_test["Frequency"], df_test["Exposure"]
)
gini = 1 - 2 * auc(cum_exposure, cum_claims)
label = "Oracle (Gini: {:.2f})".format(gini)
ax.plot(cum_exposure, cum_claims, linestyle="-.", color="gray", label=label)

# Random Baseline
ax.plot([0, 1], [0, 1], linestyle="--", color="black", label="Random baseline")
ax.set(
    title="Lorenz curves by model",
    xlabel="Cumulative proportion of exposure (from safest to riskiest)",
    ylabel="Cumulative proportion of claims",
)
ax.legend(loc="upper left")

<matplotlib.legend.Legend object at 0x7fb485ae2910>

正如所料，虚拟回归器无法正确地对样本进行排名，因此在此图上表现最差。

基于树的模型在按风险对投保人进行排名方面明显更好，而两个线性模型的性能相似。

所有三个模型都明显优于随机猜测，但也远未做出完美预测。

最后一点是由于问题的性质所致：事故的发生主要受数据集中未捕获的偶然原因支配，并且确实可以被认为是纯粹随机的。

线性模型假设输入变量之间没有交互作用，这可能导致欠拟合。插入多项式特征提取器 (PolynomialFeatures) 确实增加了它们的判别能力，提高了 2 个基尼指数点。特别是它提高了模型识别前 5% 最危险配置文件的能力。

主要收获#

模型的性能可以通过它们产生良好校准的预测和良好排名的能力来评估。
模型的校准可以通过绘制按预测风险分箱的测试样本组的平均观察值与平均预测值来评估。
Ridge 回归模型的最小二乘损失（以及隐式使用的恒等链接函数）似乎导致此模型校准不良。特别是，它倾向于低估风险，甚至可以预测无效的负频率。
使用带有对数链接的泊松损失可以纠正这些问题，并产生校准良好的线性模型。
基尼指数反映了模型对预测进行排名的能力，而不管其绝对值如何，因此只评估其排名能力。
尽管校准有所改进，但两个线性模型的排名能力相当，并且远低于梯度提升回归树的排名能力。
作为评估指标计算的泊松离差反映了模型的校准和排名能力。它还对响应变量的预期值和方差之间的理想关系做出了线性假设。为了简洁起见，我们没有检查这个假设是否成立。
传统的回归指标（例如均方误差和平均绝对误差）在具有许多零的计数数据上很难有意义地解释。

plt.show()

脚本总运行时间： (0 分钟 17.753 秒)