使用树林进行特征重要性分析#

此示例展示了如何使用树林（forest of trees）来评估特征在人工分类任务中的重要性。蓝色条形图显示了森林的特征重要性，误差线表示树间的变异性。

正如所料，该图表明有3个特征具有信息量，而其余特征则没有。

# Authors: The scikit-learn developers
# SPDX-License-Identifier: BSD-3-Clause

import matplotlib.pyplot as plt

数据生成和模型拟合#

我们生成一个只包含3个信息特征的合成数据集。我们特意不打乱数据集，以确保信息特征对应于X的前三列。此外，我们将数据集分成训练和测试子集。

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

X, y = make_classification(
    n_samples=1000,
    n_features=10,
    n_informative=3,
    n_redundant=0,
    n_repeated=0,
    n_classes=2,
    random_state=0,
    shuffle=False,
)
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)

将拟合一个随机森林分类器来计算特征重要性。

from sklearn.ensemble import RandomForestClassifier

feature_names = [f"feature {i}" for i in range(X.shape[1])]
forest = RandomForestClassifier(random_state=0)
forest.fit(X_train, y_train)

RandomForestClassifier(random_state=0)

在 Jupyter 环境中，请重新运行此单元格以显示 HTML 表示形式或信任 notebook。
在 GitHub 上，HTML 表示形式无法渲染，请尝试使用 nbviewer.org 加载此页面。

基于平均不纯度下降的特征重要性#

特征重要性由拟合属性 feature_importances_ 提供，它们是作为每棵树中不纯度下降积累的平均值和标准差计算的。

警告

基于不纯度的特征重要性对于高基数特征（许多唯一值）可能会产生误导。请参阅下面的排列特征重要性作为替代方案。

import time

import numpy as np

start_time = time.time()
importances = forest.feature_importances_
std = np.std([tree.feature_importances_ for tree in forest.estimators_], axis=0)
elapsed_time = time.time() - start_time

print(f"Elapsed time to compute the importances: {elapsed_time:.3f} seconds")

Elapsed time to compute the importances: 0.014 seconds

让我们绘制基于不纯度的重要性。

import pandas as pd

forest_importances = pd.Series(importances, index=feature_names)

fig, ax = plt.subplots()
forest_importances.plot.bar(yerr=std, ax=ax)
ax.set_title("Feature importances using MDI")
ax.set_ylabel("Mean decrease in impurity")
fig.tight_layout()

我们观察到，正如预期的那样，前三个特征被认为很重要。

基于特征排列的特征重要性#

排列特征重要性克服了基于不纯度的特征重要性的局限性：它们不会偏向高基数特征，并且可以在留出的测试集上计算。

from sklearn.inspection import permutation_importance

start_time = time.time()
result = permutation_importance(
    forest, X_test, y_test, n_repeats=10, random_state=42, n_jobs=2
)
elapsed_time = time.time() - start_time
print(f"Elapsed time to compute the importances: {elapsed_time:.3f} seconds")

forest_importances = pd.Series(result.importances_mean, index=feature_names)

Elapsed time to compute the importances: 0.906 seconds

计算完整的排列重要性成本更高。每个特征被打乱n次，并使用模型对打乱的数据进行预测，以查看性能下降。有关更多详细信息，请参阅排列特征重要性。我们现在可以绘制重要性排名。

fig, ax = plt.subplots()
forest_importances.plot.bar(yerr=result.importances_std, ax=ax)
ax.set_title("Feature importances using permutation on full model")
ax.set_ylabel("Mean accuracy decrease")
fig.tight_layout()
plt.show()

Feature importances using permutation on full model

使用这两种方法都检测到相同的特征是最重要的。尽管相对重要性有所不同。从图表可以看出，MDI 比排列重要性更不可能完全忽略某个特征。

脚本总运行时间： (0 minutes 1.351 seconds)

	n_estimators n_estimators: int, default=100 森林中树的数量。 .. versionchanged:: 0.22 ``n_estimators`` 的默认值在 0.22 版本中从 10 更改为 100。	100
	criterion criterion: {"gini", "entropy", "log_loss"}, default="gini" 衡量分割质量的函数。支持的准则包括用于基尼不纯度（Gini impurity）的 "gini" 以及用于香农信息增益（Shannon information gain）的 "log_loss" 和 "entropy"，请参见 :ref:`tree_mathematical_formulation`。注意：此参数特定于树。	'gini'
	max_depth max_depth: int, default=None 树的最大深度。如果为None，则节点会一直扩展，直到所有叶子都是纯的，或者所有叶子包含的样本数少于 min_samples_split。	None
	min_samples_split min_samples_split: int or float, default=2 分割内部节点所需的最小样本数： - 如果为 int，则 min_samples_split 为最小样本数。 - 如果为 float，则 min_samples_split 为分数，`ceil(min_samples_split * n_samples)` 是每次分割的最小样本数。 .. versionchanged:: 0.18 添加了浮点值以表示分数。	2
	min_samples_leaf min_samples_leaf: int or float, default=1 叶节点所需的最小样本数。只有当分割点能使左右分支至少包含 ``min_samples_leaf`` 个训练样本时，才会考虑该分割点。这可能具有平滑模型的效果，尤其是在回归中。 - 如果为 int，则 min_samples_leaf 为最小样本数。 - 如果为 float，则 min_samples_leaf 为分数，`ceil(min_samples_leaf * n_samples)` 是每个节点的最小样本数。 .. versionchanged:: 0.18 添加了浮点值以表示分数。	1
	min_weight_fraction_leaf min_weight_fraction_leaf: float, default=0.0 在叶节点处所需的最小加权分数（所有输入样本权重的总和）。未提供 sample_weight 时，样本具有相同的权重。	0.0
	max_features max_features: {"sqrt", "log2", None}, int or float, default="sqrt" 寻找最佳分割时要考虑的特征数量： - 如果为 int，则每次分割考虑 `max_features` 个特征。 - 如果为 float，则 `max_features` 为分数，每次分割考虑 `max(1, int(max_features * n_features_in_))` 个特征。 - 如果为 "sqrt"，则 `max_features=sqrt(n_features)`。 - 如果为 "log2"，则 `max_features=log2(n_features)`。 - 如果为 None，则 `max_features=n_features`。 .. versionchanged:: 1.1 `max_features` 的默认值从 `"auto"` 更改为 `"sqrt"`。注意：搜索分割不会停止，直到找到至少一个有效的节点样本分区，即使需要检查超过 ``max_features`` 个特征。	'sqrt'
	max_leaf_nodes max_leaf_nodes: int, default=None 以最佳优先方式增长树，其中 ``max_leaf_nodes`` 个叶节点。最佳节点被定义为相对杂质减少。如果为 None，则叶节点数量不受限制。	None
	min_impurity_decrease min_impurity_decrease: float, default=0.0 如果分裂导致的杂质减少大于或等于此值，则会分裂节点。加权杂质减少方程如下所示： N_t / N * (impurity - N_t_R / N_t * right_impurity - N_t_L / N_t * left_impurity) 其中 ``N`` 是样本总数，``N_t`` 是当前节点的样本数，``N_t_L`` 是左子节点的样本数，``N_t_R`` 是右子节点的样本数。如果传递了 ``sample_weight``，则 ``N``、``N_t``、``N_t_R`` 和 ``N_t_L`` 都指加权和。 .. versionadded:: 0.19	0.0
	bootstrap bootstrap: bool, default=True 构建树时是否使用 bootstrap 样本。如果为 False，则使用整个数据集来构建每棵树。	True
	oob_score oob_score: bool or callable, default=False 是否使用袋外样本（out-of-bag samples）来估计泛化分数。默认情况下使用 :func:`~sklearn.metrics.accuracy_score`。提供一个签名为 `metric(y_true, y_pred)` 的可调用对象来使用自定义指标。仅在 `bootstrap=True` 时可用。有关袋外（OOB）误差估计的说明，请参见示例 :ref:`sphx_glr_auto_examples_ensemble_plot_ensemble_oob.py`。	False
	n_jobs n_jobs: int, default=None 并行运行的作业数。``None`` 表示 1，除非在 :obj:`joblib.parallel_backend` 上下文中。``-1`` 表示使用所有处理器。有关详细信息，请参见 :term:`Glossary`。	None
	random_state random_state: int, RandomState instance or None, default=None 控制构建树时使用的样本引导（bootstrap）的随机性（如果 ``bootstrap=True``）以及在每个节点寻找最佳分割时要考虑的特征采样（如果 ``max_features < n_features``）。有关详细信息，请参见 :term:`Glossary `。	0
	verbose verbose: int, default=0 控制拟合和预测时的冗余度。	0
	warm_start warm_start: bool, default=False 设置为 ``True`` 时，重用上一次调用 fit 的解决方案，并向集成添加更多估计器，否则，拟合一个全新的森林。有关详细信息，请参阅 :term:`Glossary ` 和 :ref:`tree_ensemble_warm_start`。	False
	class_weight class_weight: {"balanced", "balanced_subsample"}, dict or list of dicts, default=None 与类关联的权重，形式为 ``{class_label: weight}``。如果未给出，则所有类都被假定权重为1。对于多输出问题，可以按照y的列顺序提供字典列表。请注意，对于多输出（包括多标签），权重应该为 y的每一列的每个类在各自的字典中定义。例如，对于四类多标签分类，权重应该是 [{0: 1, 1: 1}, {0: 1, 1: 5}, {0: 1, 1: 1}, {0: 1, 1: 1}] 而不是 [{1:1}, {2:5}, {3:1}, {4:1}]。 "balanced" 模式使用y的值自动调整权重，使其与输入数据中的类频率成反比，计算公式为 ``n_samples / (n_classes * np.bincount(y))`` "balanced_subsample" 模式与 "balanced" 相同，只是权重是根据每棵树的引导样本计算的。对于多输出，y的每列的权重将相乘。请注意，如果指定了 sample_weight（通过 fit 方法传入），则这些权重将与 sample_weight 相乘。	None
	ccp_alpha ccp_alpha: non-negative float, default=0.0 用于最小成本复杂性剪枝（Minimal Cost-Complexity Pruning）的复杂性参数。将选择成本复杂性小于 ``ccp_alpha`` 的最大子树。默认情况下不执行剪枝。有关详细信息，请参见 :ref:`minimal_cost_complexity_pruning`。有关此类剪枝的示例，请参见 :ref:`sphx_glr_auto_examples_tree_plot_cost_complexity_pruning.py`。 .. versionadded:: 0.22	0.0
	max_samples max_samples: int or float, default=None 如果 bootstrap 为 True，则从 X 中抽取用于训练每个基本估计器的样本数量。 - 如果为 None（默认），则抽取 `X.shape[0]` 个样本。 - 如果为 int，则抽取 `max_samples` 个样本。 - 如果为 float，则抽取 `max(round(n_samples * max_samples), 1)` 个样本。因此，`max_samples` 应在区间 `(0.0, 1.0]` 内。 .. versionadded:: 0.22	None
	monotonic_cst monotonic_cst: array-like of int of shape (n_features), default=None 指示对每个特征施加的单调性约束。 - 1: 单调增加 - 0: 无约束 - -1: 单调减少如果 monotonic_cst 为 None，则不应用约束。不支持单调性约束的情况： - 多类别分类（即当 `n_classes > 2` 时）， - 多输出分类（即当 `n_outputs_ > 1` 时）， - 在有缺失值的数据上训练的分类。约束适用于正类别的概率。在 :ref:`User Guide ` 中了解更多信息。 .. versionadded:: 1.4	None

使用树林进行特征重要性分析#

数据生成和模型拟合#

基于平均不纯度下降的特征重要性#

基于特征排列的特征重要性#

本页