基于树森林的特征重要性#

此示例演示了如何使用树森林来评估特征在人工分类任务中的重要性。蓝色条表示森林的特征重要性，误差条表示树间差异。

正如预期的那样，该图表明3个特征是有信息的，而其余特征则不是。

# Authors: The scikit-learn developers
# SPDX-License-Identifier: BSD-3-Clause

import matplotlib.pyplot as plt

数据生成和模型拟合#

我们生成一个只有3个信息特征的合成数据集。我们将明确不打乱数据集，以确保信息特征将对应于X的前三列。此外，我们将数据集拆分为训练集和测试集。

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

X, y = make_classification(
    n_samples=1000,
    n_features=10,
    n_informative=3,
    n_redundant=0,
    n_repeated=0,
    n_classes=2,
    random_state=0,
    shuffle=False,
)
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)

将拟合随机森林分类器以计算特征重要性。

from sklearn.ensemble import RandomForestClassifier

feature_names = [f"feature {i}" for i in range(X.shape[1])]
forest = RandomForestClassifier(random_state=0)
forest.fit(X_train, y_train)

RandomForestClassifier(random_state=0)

在Jupyter环境中，请重新运行此单元格以显示HTML表示或信任notebook。
在GitHub上，HTML表示无法呈现，请尝试使用nbviewer.org加载此页面。

基于均值减少不纯度的特征重要性#

特征重要性由拟合属性feature_importances_提供，它计算为每棵树内不纯度减少累积的均值和标准差。

警告

对于**高基数**特征（许多唯一值），基于不纯度的特征重要性可能会产生误导。请参见下面的排列特征重要性作为替代方法。

import time

import numpy as np

start_time = time.time()
importances = forest.feature_importances_
std = np.std([tree.feature_importances_ for tree in forest.estimators_], axis=0)
elapsed_time = time.time() - start_time

print(f"Elapsed time to compute the importances: {elapsed_time:.3f} seconds")

Elapsed time to compute the importances: 0.009 seconds

让我们绘制基于不纯度的重要性。

import pandas as pd

forest_importances = pd.Series(importances, index=feature_names)

fig, ax = plt.subplots()
forest_importances.plot.bar(yerr=std, ax=ax)
ax.set_title("Feature importances using MDI")
ax.set_ylabel("Mean decrease in impurity")
fig.tight_layout()

我们观察到，正如预期的那样，前三个特征被认为很重要。

基于特征排列的特征重要性#

排列特征重要性克服了基于不纯度的特征重要性的局限性：它们对高基数特征没有偏差，并且可以在留出的测试集上计算。

from sklearn.inspection import permutation_importance

start_time = time.time()
result = permutation_importance(
    forest, X_test, y_test, n_repeats=10, random_state=42, n_jobs=2
)
elapsed_time = time.time() - start_time
print(f"Elapsed time to compute the importances: {elapsed_time:.3f} seconds")

forest_importances = pd.Series(result.importances_mean, index=feature_names)

Elapsed time to compute the importances: 0.726 seconds

完整排列重要性的计算成本更高。特征被随机打乱n次，并重新拟合模型以估计其重要性。有关更多详细信息，请参见排列特征重要性。现在我们可以绘制重要性排名。

fig, ax = plt.subplots()
forest_importances.plot.bar(yerr=result.importances_std, ax=ax)
ax.set_title("Feature importances using permutation on full model")
ax.set_ylabel("Mean accuracy decrease")
fig.tight_layout()
plt.show()

Feature importances using permutation on full model

两种方法都检测到相同的特征最重要。尽管相对重要性有所不同。从图中可以看出，与排列重要性相比，MDI不太可能完全忽略一个特征。

**脚本总运行时间：**（0 分钟 1.355 秒）