IsolationForest 示例#

使用 IsolationForest 进行异常检测的示例。

Isolation Forest 是一个由“Isolation Trees”（隔离树）组成的集成模型，它通过递归随机划分来“隔离”观测值，这可以用树结构表示。对于离群值（outliers），隔离样本所需的划分次数较少；对于正常值（inliers），则需要较多的划分次数。

在本示例中，我们演示了两种可视化在玩具数据集上训练的 Isolation Forest 决策边界的方法。

# Authors: The scikit-learn developers
# SPDX-License-Identifier: BSD-3-Clause

数据生成#

我们通过随机采样 numpy.random.randn 返回的标准正态分布来生成两个簇（每个簇包含 n_samples）。其中一个簇是球形的，另一个簇略微变形。

为了与 IsolationForest 的表示法保持一致，正常值（即高斯簇）被赋予地面真值标签 1，而离群值（使用 numpy.random.uniform 创建）被赋予标签 -1。

import numpy as np

from sklearn.model_selection import train_test_split

n_samples, n_outliers = 120, 40
rng = np.random.RandomState(0)
covariance = np.array([[0.5, -0.1], [0.7, 0.4]])
cluster_1 = 0.4 * rng.randn(n_samples, 2) @ covariance + np.array([2, 2])  # general
cluster_2 = 0.3 * rng.randn(n_samples, 2) + np.array([-2, -2])  # spherical
outliers = rng.uniform(low=-4, high=4, size=(n_outliers, 2))

X = np.concatenate([cluster_1, cluster_2, outliers])
y = np.concatenate(
    [np.ones((2 * n_samples), dtype=int), -np.ones((n_outliers), dtype=int)]
)

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)

我们可以可视化生成的簇

import matplotlib.pyplot as plt

scatter = plt.scatter(X[:, 0], X[:, 1], c=y, s=20, edgecolor="k")
handles, labels = scatter.legend_elements()
plt.axis("square")
plt.legend(handles=handles, labels=["outliers", "inliers"], title="true class")
plt.title("Gaussian inliers with \nuniformly distributed outliers")
plt.show()

Gaussian inliers with uniformly distributed outliers

模型训练#

from sklearn.ensemble import IsolationForest

clf = IsolationForest(max_samples=100, random_state=0)
clf.fit(X_train)

IsolationForest(max_samples=100, random_state=0)

在 Jupyter 环境中，请重新运行此单元格以显示 HTML 表示形式或信任 notebook。
在 GitHub 上，HTML 表示形式无法渲染，请尝试使用 nbviewer.org 加载此页面。

绘制离散决策边界#

我们使用类 DecisionBoundaryDisplay 来可视化离散决策边界。背景颜色表示该给定区域中的样本是否被预测为离群值。散点图显示真实的标签。

import matplotlib.pyplot as plt

from sklearn.inspection import DecisionBoundaryDisplay

disp = DecisionBoundaryDisplay.from_estimator(
    clf,
    X,
    response_method="predict",
    alpha=0.5,
)
disp.ax_.scatter(X[:, 0], X[:, 1], c=y, s=20, edgecolor="k")
disp.ax_.set_title("Binary decision boundary \nof IsolationForest")
plt.axis("square")
plt.legend(handles=handles, labels=["outliers", "inliers"], title="true class")
plt.show()

Binary decision boundary of IsolationForest

绘制路径长度决策边界#

通过设置 response_method="decision_function"，DecisionBoundaryDisplay 的背景表示观测值的正常性度量。这个分数由在随机树森林中平均的路径长度给出，路径长度本身由隔离给定样本所需的叶子深度（或等效于分裂次数）给出。

当随机树森林集体为隔离某些特定样本产生短路径长度时，它们极有可能是异常值，正常性度量接近 0。同样，长路径对应于接近 1 的值，并且更可能是正常值。

disp = DecisionBoundaryDisplay.from_estimator(
    clf,
    X,
    response_method="decision_function",
    alpha=0.5,
)
disp.ax_.scatter(X[:, 0], X[:, 1], c=y, s=20, edgecolor="k")
disp.ax_.set_title("Path length decision boundary \nof IsolationForest")
plt.axis("square")
plt.legend(handles=handles, labels=["outliers", "inliers"], title="true class")
plt.colorbar(disp.ax_.collections[1])
plt.show()

Path length decision boundary of IsolationForest

脚本总运行时间： (0 分钟 0.413 秒)

	n_estimators n_estimators: int, default=100 集成中的基础估计器数量。	100
	max_samples max_samples: "auto", int or float, default="auto" 从 X 中抽取用于训练每个基础估计器的样本数量。 - 如果为 int，则抽取 `max_samples` 个样本。 - 如果为 float，则抽取 `max_samples * X.shape[0]` 个样本。 - 如果为 "auto"，则 `max_samples=min(256, n_samples)`。如果 max_samples 大于提供的样本数量，所有样本将用于所有树（不进行采样）。	100
	contamination contamination: 'auto' or float, default='auto' 数据集的污染程度，即数据集中离群值的比例。在拟合时用于定义样本分数的阈值。 - 如果为 'auto'，则按照原始论文中的方式确定阈值。 - 如果为 float，则污染程度应在范围 (0, 0.5] 内。 .. versionchanged:: 0.22 ``contamination`` 的默认值从 0.1 更改为 ``'auto'``。	'auto'
	max_features max_features: int or float, default=1.0 从 X 中抽取用于训练每个基础估计器的特征数量。 - 如果为 int，则抽取 `max_features` 个特征。 - 如果为 float，则抽取 `max(1, int(max_features * n_features_in_))` 个特征。注意：使用小于 1.0 的浮点数或小于特征数量的整数将启用特征子采样，并导致运行时间延长。	1.0
	bootstrap bootstrap: bool, default=False 如果为 True，则使用有放回抽样从训练数据中随机子集拟合单个树。如果为 False，则执行无放回抽样。	False
	n_jobs n_jobs: int, default=None 为 :meth:`fit` 并行运行的作业数。``None`` 表示 1，除非在 :obj:`joblib.parallel_backend` 上下文中。``-1`` 表示使用所有处理器。有关详细信息，请参阅 :term:`Glossary `。	None
	random_state random_state: int, RandomState instance or None, default=None 控制特征选择和分裂值的伪随机性，以及森林中每棵树的每个分支步骤。传入 int 可在多次函数调用中获得可重现的结果。请参阅 :term:`Glossary `。	0
	verbose verbose: int, default=0 控制树构建过程的详细程度。	0
	warm_start warm_start: bool, default=False 设置为 ``True`` 时，重用上一次调用 fit 的解决方案，并向集成中添加更多估计器；否则，拟合一个全新的森林。请参阅 :term:`Glossary `。 .. versionadded:: 0.21	False