目标编码器#

class sklearn.preprocessing.TargetEncoder(categories='auto', target_type='auto', smooth='auto', cv=5, shuffle=True, random_state=None)[source]#

用于回归和分类目标的目标编码器。

每个类别都根据属于该类别的观测值的平均目标值的收缩估计进行编码。编码方案将全局目标均值与以类别值为条件的目标均值混合（参见 [MIC]）。

当目标类型为“多类别”时，编码基于每个类的条件概率估计。目标首先使用 LabelBinarizer 通过“一对多”方案进行二值化，然后使用每个类和每个类别的平均目标值进行编码，从而产生 n_features * n_classes 个编码的输出特征。

TargetEncoder 会将缺失值，例如 np.nan 或 None，视为另一类，并像其他类别一样对其进行编码。在 fit 过程中未见过的类别将使用目标均值进行编码，即 target_mean_。

有关 TargetEncoder 内部交叉拟合重要性的演示，请参见目标编码器的内部交叉拟合。有关不同编码器的比较，请参考比较目标编码器与其他编码器。在用户指南中了解更多信息。

注意

fit(X, y).transform(X) 不等于 fit_transform(X, y)，因为在 fit_transform 中使用了交叉拟合方案进行编码。详情请参见用户指南。

1.3 版本中新增。

参数：

categories“auto” 或 shape 为 (n_features,) 的列表或类似数组，默认为“auto”

每个特征的类别（唯一值）

"auto"：从训练数据中自动确定类别。
列表：categories[i] 包含第 i 列中预期的类别。传递的类别不应在一个特征内混合字符串和数值，并且在数值的情况下应排序。

使用的类别存储在 categories_ 拟合属性中。

target_type{“auto”, “continuous”, “binary”, “multiclass”}，默认为“auto”

目标类型。

"auto"：使用 type_of_target 推断目标类型。
"continuous"：连续目标
"binary"：二元目标
"multiclass"：多类别目标

注意

使用 "auto" 推断的目标类型可能不是建模所需的目标类型。例如，如果目标包含 0 到 100 之间的整数，则 type_of_target 会将目标推断为 "multiclass"。在这种情况下，设置 target_type="continuous" 将指定目标为回归问题。target_type_ 属性给出编码器使用的目标类型。

1.4 版本中的更改：添加了选项“multiclass”。

smooth“auto” 或浮点数，默认为“auto”

根据类别的值对目标均值与全局目标均值的混合量。较大的 smooth 值将赋予全局目标均值更大的权重。如果为 "auto"，则 smooth 设置为经验贝叶斯估计值。

cv整数，默认为 5

确定交叉拟合策略中使用的折叠数，该策略用于 fit_transform。对于分类目标，使用 StratifiedKFold，对于连续目标，使用 KFold。

shuffle布尔值，默认为 True

是否在 fit_transform 中划分成折叠之前打乱数据。请注意，每个分割中的样本不会被打乱。

random_state整数、RandomState 实例或 None，默认为 None

当 shuffle 为 True 时，random_state 会影响索引的顺序，从而控制每个折叠的随机性。否则，此参数无效。传递整数以在多次函数调用中获得可重复的输出。参见词汇表。

属性：

encodings_shape 为 (n_features,) 或 (n_features * n_classes) 的 ndarray 列表: 在所有 X 上学习的编码。对于特征 i，encodings_[i] 是与 categories_[i] 中列出的类别匹配的编码。当 target_type_ 为“multiclass”时，特征 i 和类别 j 的编码存储在 encodings_[j + (i * len(classes_))] 中。例如，对于 2 个特征 (f) 和 3 个类别 (c)，编码的顺序为：f0_c0、f0_c1、f0_c2、f1_c0、f1_c1、f1_c2，
categories_shape 为 (n_features,) 的 ndarray 列表: 在拟合过程中确定的每个输入特征的类别，或在 categories 中指定（按照 X 中特征的顺序，并与 transform 的输出相对应）。
target_type_字符串: 目标类型。
target_mean_浮点数: 目标的总体均值。此值仅用于 transform 中对类别进行编码。
n_features_in_整数: 在拟合过程中看到的特征数量。
feature_names_in_shape 为 (n_features_in_,) 的 ndarray: 在拟合过程中观察到的特征名称。仅当X的特征名称全部为字符串时才定义。
classes_ndarray 或 None: 如果target_type_ 为 'binary' 或 'multiclass'，则保存每个类的标签，否则为None。

另请参见

序数编码器: 对分类特征执行序数（整数）编码。与 TargetEncoder 不同，此编码不是监督的。因此，将生成的编码视为数值特征会导致任意排序的值，因此在用作分类器或回归器的预处理时通常会导致较低的预测性能。
独热编码器: 对分类特征执行独热编码。这种无监督编码更适合低基数分类变量，因为它为每个唯一类别生成一个新特征。

参考文献

[MIC]

Micci-Barreca, Daniele. “A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems” SIGKDD Explor. Newsl. 3, 1 (July 2001), 27–32.

示例

使用smooth="auto"时，平滑参数设置为经验贝叶斯估计值。

>>> import numpy as np
>>> from sklearn.preprocessing import TargetEncoder
>>> X = np.array([["dog"] * 20 + ["cat"] * 30 + ["snake"] * 38], dtype=object).T
>>> y = [90.3] * 5 + [80.1] * 15 + [20.4] * 5 + [20.1] * 25 + [21.2] * 8 + [49] * 30
>>> enc_auto = TargetEncoder(smooth="auto")
>>> X_trans = enc_auto.fit_transform(X, y)

>>> # A high `smooth` parameter puts more weight on global mean on the categorical
>>> # encodings:
>>> enc_high_smooth = TargetEncoder(smooth=5000.0).fit(X, y)
>>> enc_high_smooth.target_mean_
np.float64(44...)
>>> enc_high_smooth.encodings_
[array([44..., 44..., 44...])]

>>> # On the other hand, a low `smooth` parameter puts more weight on target
>>> # conditioned on the value of the categorical:
>>> enc_low_smooth = TargetEncoder(smooth=1.0).fit(X, y)
>>> enc_low_smooth.encodings_
[array([20..., 80..., 43...])]

fit(X, y)[source]#

将 TargetEncoder 拟合到 X 和 y。

参数：

X形状为 (n_samples, n_features) 的类数组: 用于确定每个特征类别的的数据。
y形状为 (n_samples,) 的类数组: 用于编码类别的目标数据。

返回：

self对象: 已拟合的编码器。

fit_transform(X, y)[source]#

拟合 TargetEncoder 并使用目标编码转换 X。

注意

fit(X, y).transform(X) 不等于 fit_transform(X, y)，因为在 fit_transform 中使用了交叉拟合方案。有关详细信息，请参见用户指南。

参数：

X形状为 (n_samples, n_features) 的类数组: 用于确定每个特征类别的的数据。
y形状为 (n_samples,) 的类数组: 用于编码类别的目标数据。

返回：

X_trans形状为 (n_samples, n_features) 或 (n_samples, (n_features * n_classes)) 的ndarray: 转换后的输入。

get_feature_names_out(input_features=None)[source]#

获取转换后的输出特征名称。

参数：

input_features字符串类数组或 None，默认为 None: 未使用，此处出于 API 一致性约定而存在。

返回：

feature_names_out字符串对象的 ndarray: 转换后的特征名称。除非未定义 feature_names_in_，否则使用 feature_names_in_，在这种情况下，将生成以下输入特征名称：["x0", "x1", ..., "x(n_features_in_ - 1)"]。当 type_of_target_ 为 “multiclass” 时，名称的格式为 ‘<feature_name>_<class_name>’。

get_metadata_routing()[source]#

获取此对象的元数据路由。

请查看用户指南，了解路由机制的工作原理。

返回：

routingMetadataRequest: 一个 MetadataRequest，封装了路由信息。

get_params(deep=True)[source]#

获取此估计器的参数。

参数：

deep布尔值，默认为 True: 如果为 True，则将返回此估计器和作为估计器的包含子对象的参数。

返回：

params字典: 参数名称与其值的映射。

property infrequent_categories_#: 每个特征的不常用类别。

set_output(*, transform=None)[source]#

设置输出容器。

有关如何使用 API 的示例，请参见介绍 set_output API。

参数：

transform{"default", "pandas", "polars"}，默认为 None

配置transform和fit_transform的输出。

"default"：转换器的默认输出格式
"pandas"：DataFrame 输出
"polars"：Polars 输出
None：转换配置不变

版本 1.4 中新增： "polars" 选项已添加。

返回：

self估计器实例: 估计器实例。

set_params(**params)[source]#

设置此估计器的参数。

此方法适用于简单的估计器以及嵌套对象（例如 Pipeline）。后者具有 <component>__<parameter> 形式的参数，因此可以更新嵌套对象的每个组件。

参数：

**params字典: 估计器参数。

返回：

self估计器实例: 估计器实例。

transform(X)[source]#

使用目标编码转换 X。

注意

fit(X, y).transform(X) 不等于 fit_transform(X, y)，因为在 fit_transform 中使用了交叉拟合方案。有关详细信息，请参见用户指南。

参数：

X形状为 (n_samples, n_features) 的类数组: 用于确定每个特征类别的的数据。

返回：

X_trans形状为 (n_samples, n_features) 或 (n_samples, (n_features * n_classes)) 的ndarray: 转换后的输入。

示例库#

scikit-learn 1.3 版本亮点

比较目标编码器和其他编码器

目标编码器的内部交叉拟合