使用随机投影进行嵌入的 Johnson-Lindenstrauss 界#

Johnson-Lindenstrauss 引理指出,任何高维数据集都可以随机投影到低维欧几里得空间,同时控制成对距离的失真。

# Authors: The scikit-learn developers
# SPDX-License-Identifier: BSD-3-Clause

import sys
from time import time

import matplotlib.pyplot as plt
import numpy as np

from sklearn.datasets import fetch_20newsgroups_vectorized, load_digits
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.random_projection import (
    SparseRandomProjection,
    johnson_lindenstrauss_min_dim,
)

理论界限#

随机投影 p 引入的失真通过 p 以高概率定义一个 eps-嵌入的事实来确定,定义如下:

\[(1 - eps) \|u - v\|^2 < \|p(u) - p(v)\|^2 < (1 + eps) \|u - v\|^2\]

其中 uv 是从形状为 (n_samples, n_features) 的数据集中取出的任意行,而 p 是通过形状为 (n_components, n_features) 的随机高斯 N(0, 1) 矩阵(或稀疏 Achlioptas 矩阵)进行的投影。

保证 eps-嵌入所需的最小分量数由下式给出:

\[n\_components \geq 4 log(n\_samples) / (eps^2 / 2 - eps^3 / 3)\]

第一张图显示,随着样本数 n_samples 的增加,为了保证 eps-嵌入,最小维度数 n_components 呈对数增长。

# range of admissible distortions
eps_range = np.linspace(0.1, 0.99, 5)
colors = plt.cm.Blues(np.linspace(0.3, 1.0, len(eps_range)))

# range of number of samples (observation) to embed
n_samples_range = np.logspace(1, 9, 9)

plt.figure()
for eps, color in zip(eps_range, colors):
    min_n_components = johnson_lindenstrauss_min_dim(n_samples_range, eps=eps)
    plt.loglog(n_samples_range, min_n_components, color=color)

plt.legend([f"eps = {eps:0.1f}" for eps in eps_range], loc="lower right")
plt.xlabel("Number of observations to eps-embed")
plt.ylabel("Minimum number of dimensions")
plt.title("Johnson-Lindenstrauss bounds:\nn_samples vs n_components")
plt.show()
Johnson-Lindenstrauss bounds: n_samples vs n_components

第二张图显示,对于给定数量的样本 n_samples,增加允许的失真 eps 可以显著减少最小维度数 n_components

# range of admissible distortions
eps_range = np.linspace(0.01, 0.99, 100)

# range of number of samples (observation) to embed
n_samples_range = np.logspace(2, 6, 5)
colors = plt.cm.Blues(np.linspace(0.3, 1.0, len(n_samples_range)))

plt.figure()
for n_samples, color in zip(n_samples_range, colors):
    min_n_components = johnson_lindenstrauss_min_dim(n_samples, eps=eps_range)
    plt.semilogy(eps_range, min_n_components, color=color)

plt.legend([f"n_samples = {n}" for n in n_samples_range], loc="upper right")
plt.xlabel("Distortion eps")
plt.ylabel("Minimum number of dimensions")
plt.title("Johnson-Lindenstrauss bounds:\nn_components vs eps")
plt.show()
Johnson-Lindenstrauss bounds: n_components vs eps

经验验证#

我们对 20 个新闻组文本文档(TF-IDF 词频)数据集或数字数据集进行上述界限的验证。

  • 对于 20 个新闻组数据集,总共有 300 篇文档,包含 100k 个特征,使用稀疏随机矩阵投影到具有不同目标维度数 n_components 值的较小欧几里得空间。

  • 对于数字数据集,300 张手写数字图片的 8x8 灰度像素数据被随机投影到具有各种更大维度数 n_components 的空间。

默认数据集是 20 个新闻组数据集。要在数字数据集上运行此示例,请将 --use-digits-dataset 命令行参数传递给此脚本。

if "--use-digits-dataset" in sys.argv:
    data = load_digits().data[:300]
else:
    data = fetch_20newsgroups_vectorized().data[:300]

对于每个 n_components 值,我们绘制:

  • 样本对在原始空间和投影空间中的成对距离的 2D 分布,分别作为 x 轴和 y 轴。

  • 这些距离比率(投影/原始)的 1D 直方图。

n_samples, n_features = data.shape
print(
    f"Embedding {n_samples} samples with dim {n_features} using various "
    "random projections"
)

n_components_range = np.array([300, 1_000, 10_000])
dists = euclidean_distances(data, squared=True).ravel()

# select only non-identical samples pairs
nonzero = dists != 0
dists = dists[nonzero]

for n_components in n_components_range:
    t0 = time()
    rp = SparseRandomProjection(n_components=n_components)
    projected_data = rp.fit_transform(data)
    print(
        f"Projected {n_samples} samples from {n_features} to {n_components} in "
        f"{time() - t0:0.3f}s"
    )
    if hasattr(rp, "components_"):
        n_bytes = rp.components_.data.nbytes
        n_bytes += rp.components_.indices.nbytes
        print(f"Random matrix with size: {n_bytes / 1e6:0.3f} MB")

    projected_dists = euclidean_distances(projected_data, squared=True).ravel()[nonzero]

    plt.figure()
    min_dist = min(projected_dists.min(), dists.min())
    max_dist = max(projected_dists.max(), dists.max())
    plt.hexbin(
        dists,
        projected_dists,
        gridsize=100,
        cmap=plt.cm.PuBu,
        extent=[min_dist, max_dist, min_dist, max_dist],
    )
    plt.xlabel("Pairwise squared distances in original space")
    plt.ylabel("Pairwise squared distances in projected space")
    plt.title("Pairwise distances distribution for n_components=%d" % n_components)
    cb = plt.colorbar()
    cb.set_label("Sample pairs counts")

    rates = projected_dists / dists
    print(f"Mean distances rate: {np.mean(rates):.2f} ({np.std(rates):.2f})")

    plt.figure()
    plt.hist(rates, bins=50, range=(0.0, 2.0), edgecolor="k", density=True)
    plt.xlabel("Squared distances rate: projected / original")
    plt.ylabel("Distribution of samples pairs")
    plt.title("Histogram of pairwise distance rates for n_components=%d" % n_components)

    # TODO: compute the expected value of eps and add them to the previous plot
    # as vertical lines / region

plt.show()
  • Pairwise distances distribution for n_components=300
  • Histogram of pairwise distance rates for n_components=300
  • Pairwise distances distribution for n_components=1000
  • Histogram of pairwise distance rates for n_components=1000
  • Pairwise distances distribution for n_components=10000
  • Histogram of pairwise distance rates for n_components=10000
Embedding 300 samples with dim 130107 using various random projections
Projected 300 samples from 130107 to 300 in 0.218s
Random matrix with size: 1.301 MB
Mean distances rate: 1.02 (0.17)
Projected 300 samples from 130107 to 1000 in 0.729s
Random matrix with size: 4.324 MB
Mean distances rate: 1.01 (0.09)
Projected 300 samples from 130107 to 10000 in 7.167s
Random matrix with size: 43.239 MB
Mean distances rate: 1.01 (0.03)

我们可以看到,当 n_components 值较低时,分布很宽,包含许多失真对和偏斜分布(由于距离始终为正,左侧比率为零的硬性限制),而当 n_components 值较大时,失真得到控制,距离通过随机投影得到了很好的保留。

备注#

根据 JL 引理,投影 300 个样本而不产生过多失真将至少需要数千个维度,这与原始数据集的特征数量无关。

因此,在输入空间中仅有 64 个特征的数字数据集上使用随机投影没有意义:在这种情况下,它无法实现降维。

另一方面,在 20 个新闻组数据集上,维度可以从 56,436 减少到 10,000,同时合理地保留成对距离。

脚本总运行时间: (0 分 10.618 秒)

相关示例

手写数字上的流形学习:局部线性嵌入、Isomap…

手写数字上的流形学习:局部线性嵌入、Isomap...

HDBSCAN 聚类算法演示

HDBSCAN 聚类算法演示

Ledoit-Wolf 与 OAS 估计

Ledoit-Wolf 与 OAS 估计

多维缩放

多维缩放

由 Sphinx-Gallery 生成的图库