基于树森林的特征重要性#

本示例展示了如何使用树森林来评估特征在人工分类任务中的重要性。蓝色条形图显示了森林的特征重要性，误差条表示树间变异性。

正如预期，该图表明有 3 个特征是信息丰富的，而其余则不是。

# Authors: The scikit-learn developers
# SPDX-License-Identifier: BSD-3-Clause

import matplotlib.pyplot as plt

数据生成与模型拟合#

我们生成了一个只有 3 个信息丰富特征的合成数据集。我们将明确不打乱数据集，以确保信息特征对应于 X 的前三列。此外，我们将数据集分成训练和测试子集。

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

X, y = make_classification(
    n_samples=1000,
    n_features=10,
    n_informative=3,
    n_redundant=0,
    n_repeated=0,
    n_classes=2,
    random_state=0,
    shuffle=False,
)
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)

将拟合一个随机森林分类器来计算特征重要性。

from sklearn.ensemble import RandomForestClassifier

feature_names = [f"feature {i}" for i in range(X.shape[1])]
forest = RandomForestClassifier(random_state=0)
forest.fit(X_train, y_train)

RandomForestClassifier(random_state=0)

在 Jupyter 环境中，请重新运行此单元格以显示 HTML 表示或信任此笔记本。
在 GitHub 上，HTML 表示无法渲染，请尝试使用 nbviewer.org 加载此页面。

基于平均杂质减少的特征重要性#

特征重要性由拟合的属性 feature_importances_ 提供，它们是根据每棵树内杂质减少的累积平均值和标准差计算得出的。

警告

基于杂质的特征重要性对于高基数特征（许多唯一值）可能具有误导性。请参见下方的置换特征重要性作为替代方案。

import time

import numpy as np

start_time = time.time()
importances = forest.feature_importances_
std = np.std([tree.feature_importances_ for tree in forest.estimators_], axis=0)
elapsed_time = time.time() - start_time

print(f"Elapsed time to compute the importances: {elapsed_time:.3f} seconds")

Elapsed time to compute the importances: 0.006 seconds

让我们绘制基于杂质的重要性。

import pandas as pd

forest_importances = pd.Series(importances, index=feature_names)

fig, ax = plt.subplots()
forest_importances.plot.bar(yerr=std, ax=ax)
ax.set_title("Feature importances using MDI")
ax.set_ylabel("Mean decrease in impurity")
fig.tight_layout()

我们观察到，正如预期，前三个特征被认为是重要的。

基于特征置换的特征重要性#

置换特征重要性克服了基于杂质的特征重要性的局限性：它们对高基数特征没有偏倚，并且可以在留出的测试集上计算。

from sklearn.inspection import permutation_importance

start_time = time.time()
result = permutation_importance(
    forest, X_test, y_test, n_repeats=10, random_state=42, n_jobs=2
)
elapsed_time = time.time() - start_time
print(f"Elapsed time to compute the importances: {elapsed_time:.3f} seconds")

forest_importances = pd.Series(result.importances_mean, index=feature_names)

Elapsed time to compute the importances: 0.413 seconds

完整置换重要性的计算成本更高。特征被打乱 n 次，并重新拟合模型以估计其重要性。更多详情请参见置换特征重要性。我们现在可以绘制重要性排名。

fig, ax = plt.subplots()
forest_importances.plot.bar(yerr=result.importances_std, ax=ax)
ax.set_title("Feature importances using permutation on full model")
ax.set_ylabel("Mean accuracy decrease")
fig.tight_layout()
plt.show()

Feature importances using permutation on full model

使用两种方法检测到的最重要的特征是相同的。尽管相对重要性有所不同。从图中可以看出，MDI 比置换重要性更不容易完全忽略某个特征。

脚本总运行时间： (0 分钟 0.899 秒)

	n_estimators	100
	criterion	'gini'
	max_depth	无
	min_samples_split	2
	min_samples_leaf	1
	min_weight_fraction_leaf	0.0
	max_features	'sqrt'
	max_leaf_nodes	无
	min_impurity_decrease	0.0
	bootstrap	True
	oob_score	False
	n_jobs	无
	random_state	0
	verbose	0
	warm_start	False
	class_weight	无
	ccp_alpha	0.0
	max_samples	无
	monotonic_cst	无

基于树森林的特征重要性#

数据生成与模型拟合#

基于平均杂质减少的特征重要性#

基于特征置换的特征重要性#

本页