如何优化速度#

以下提供一些实用指南，帮助您为 scikit-learn 项目编写高效的代码。

注意

虽然分析代码以**检查性能假设**始终有用，但强烈建议您**查阅文献**，以确保所实现的算法是当前任务的最新技术，然后再进行代价高昂的实现优化。

很多次，投入在优化复杂实现细节上的数小时努力，都会因为随后发现简单的**算法技巧**或使用更适合问题的另一种算法而变得无关紧要。

Python、Cython 或 C/C++？#

一般来说，scikit-learn 项目强调源代码的**可读性**，以便项目用户可以轻松深入研究源代码，从而了解算法在其数据上的行为，同时也便于维护（由开发人员维护）。

因此，在实现新算法时，建议**首先使用 NumPy 和 SciPy 以 Python 实现它**，注意避免使用这些库的矢量化习惯用法进行循环代码。在实践中，这意味着尝试**用等效的 NumPy 数组方法替换任何嵌套的 for 循环**。目标是避免 CPU 在 Python 解释器中浪费时间，而不是计算数字以拟合您的统计模型。通常最好考虑 NumPy 和 SciPy 的性能技巧： https://scipy.github.io/old-wiki/pages/PerformanceTips

但是，有时算法无法用简单的矢量化 NumPy 代码有效地表达。在这种情况下，推荐的策略如下：

**分析** Python 实现以找到主要瓶颈，并将其隔离在**专用模块级函数**中。此函数将被重新实现为编译的扩展模块。
如果存在维护良好的 BSD 或 MIT **C/C++** 实现的相同算法，并且该算法不太大，您可以为其编写一个 **Cython 包装器**，并将库的源代码副本包含在 scikit-learn 源代码树中：此策略用于类 svm.LinearSVC、svm.SVC 和 linear_model.LogisticRegression（liblinear 和 libsvm 的包装器）。
否则，使用 **Cython** 直接编写 Python 函数的优化版本。例如，此策略用于 linear_model.ElasticNet 和 linear_model.SGDClassifier 类。
**将函数的 Python 版本移动到测试中**，并使用它来检查编译扩展的结果与黄金标准（易于调试的 Python 版本）是否一致。
优化代码后（不是通过分析可以发现的简单瓶颈），检查是否可以进行适用于**多处理**的**粗粒度并行化**，方法是使用 joblib.Parallel 类。

分析 Python 代码#

为了分析 Python 代码，我们建议编写一个脚本加载并准备您的数据，然后使用 IPython 集成分析器来交互式地探索代码的相关部分。

假设我们要分析 scikit-learn 的非负矩阵分解模块。让我们设置一个新的 IPython 会话并加载数字数据集，就像在识别手写数字示例中一样。

In [1]: from sklearn.decomposition import NMF

In [2]: from sklearn.datasets import load_digits

In [3]: X, _ = load_digits(return_X_y=True)

在开始分析会话并进行尝试性优化迭代之前，务必测量我们要优化的函数的总执行时间（没有任何分析器开销），并将其保存到某个位置以供以后参考。

In [4]: %timeit NMF(n_components=16, tol=1e-2).fit(X)
1 loops, best of 3: 1.7 s per loop

要使用 %prun 魔术命令查看整体性能分析：

In [5]: %prun -l nmf.py NMF(n_components=16, tol=1e-2).fit(X)
         14496 function calls in 1.682 CPU seconds

   Ordered by: internal time
   List reduced from 90 to 9 due to restriction <'nmf.py'>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       36    0.609    0.017    1.499    0.042 nmf.py:151(_nls_subproblem)
     1263    0.157    0.000    0.157    0.000 nmf.py:18(_pos)
        1    0.053    0.053    1.681    1.681 nmf.py:352(fit_transform)
      673    0.008    0.000    0.057    0.000 nmf.py:28(norm)
        1    0.006    0.006    0.047    0.047 nmf.py:42(_initialize_nmf)
       36    0.001    0.000    0.010    0.000 nmf.py:36(_sparseness)
       30    0.001    0.000    0.001    0.000 nmf.py:23(_neg)
        1    0.000    0.000    0.000    0.000 nmf.py:337(__init__)
        1    0.000    0.000    1.681    1.681 nmf.py:461(fit)

tottime 列是最有趣的：它给出执行给定函数代码的总时间，忽略在执行子函数中花费的时间。真实总时间（本地代码 + 子函数调用）由 cumtime 列给出。

注意使用了 -l nmf.py，它将输出限制为包含“nmf.py”字符串的行。这对于快速查看 nmf Python 模块本身的热点很有用，忽略其他任何内容。

以下是相同命令在没有 -l nmf.py 过滤器时的输出开头：

In [5] %prun NMF(n_components=16, tol=1e-2).fit(X)
         16159 function calls in 1.840 CPU seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     2833    0.653    0.000    0.653    0.000 {numpy.core._dotblas.dot}
       46    0.651    0.014    1.636    0.036 nmf.py:151(_nls_subproblem)
     1397    0.171    0.000    0.171    0.000 nmf.py:18(_pos)
     2780    0.167    0.000    0.167    0.000 {method 'sum' of 'numpy.ndarray' objects}
        1    0.064    0.064    1.840    1.840 nmf.py:352(fit_transform)
     1542    0.043    0.000    0.043    0.000 {method 'flatten' of 'numpy.ndarray' objects}
      337    0.019    0.000    0.019    0.000 {method 'all' of 'numpy.ndarray' objects}
     2734    0.011    0.000    0.181    0.000 fromnumeric.py:1185(sum)
        2    0.010    0.005    0.010    0.005 {numpy.linalg.lapack_lite.dgesdd}
      748    0.009    0.000    0.065    0.000 nmf.py:28(norm)
...

上述结果表明，执行主要由点积运算（委托给 blas）支配。因此，通过用 Cython 或 C/C++ 重写此代码可能不会获得巨大的收益：在这种情况下，在 1.7 秒的总执行时间中，几乎 0.7 秒都花在我们可以认为是最佳的编译代码上。通过重写其余的 Python 代码，并假设我们可以在此部分获得 1000% 的提升（鉴于 Python 循环的浅显性，这是极不可能的），我们整体上不会获得超过 2.4 倍的加速。

因此，在这个具体的例子中，只有通过**算法改进**才能实现重大的改进（例如，尝试找到既代价高昂又无用的操作，从而避免计算它们，而不是试图优化它们的实现）。

然而，检查_nls_subproblem函数内部发生了什么仍然很有趣，如果我们只考虑Python代码，它是热点：它占据了模块累积时间的约100%。为了更好地理解这个特定函数的概要信息，让我们安装line_profiler并将其连接到IPython。

pip install line_profiler

在IPython 0.13+版本下，首先创建一个配置配置文件

ipython profile create

然后在~/.ipython/profile_default/ipython_config.py中注册line_profiler扩展。

c.TerminalIPythonApp.extensions.append('line_profiler')
c.InteractiveShellApp.extensions.append('line_profiler')

这将在IPython终端应用程序和其他前端（如qtconsole和notebook）中注册%lprun魔法命令。

现在重启IPython，让我们使用这个新工具。

In [1]: from sklearn.datasets import load_digits

In [2]: from sklearn.decomposition import NMF
  ... : from sklearn.decomposition._nmf import _nls_subproblem

In [3]: X, _ = load_digits(return_X_y=True)

In [4]: %lprun -f _nls_subproblem NMF(n_components=16, tol=1e-2).fit(X)
Timer unit: 1e-06 s

File: sklearn/decomposition/nmf.py
Function: _nls_subproblem at line 137
Total time: 1.73153 s

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   137                                           def _nls_subproblem(V, W, H_init, tol, max_iter):
   138                                               """Non-negative least square solver
   ...
   170                                               """
   171        48         5863    122.1      0.3      if (H_init < 0).any():
   172                                                   raise ValueError("Negative values in H_init passed to NLS solver.")
   173
   174        48          139      2.9      0.0      H = H_init
   175        48       112141   2336.3      5.8      WtV = np.dot(W.T, V)
   176        48        16144    336.3      0.8      WtW = np.dot(W.T, W)
   177
   178                                               # values justified in the paper
   179        48          144      3.0      0.0      alpha = 1
   180        48          113      2.4      0.0      beta = 0.1
   181       638         1880      2.9      0.1      for n_iter in range(1, max_iter + 1):
   182       638       195133    305.9     10.2          grad = np.dot(WtW, H) - WtV
   183       638       495761    777.1     25.9          proj_gradient = norm(grad[np.logical_or(grad < 0, H > 0)])
   184       638         2449      3.8      0.1          if proj_gradient < tol:
   185        48          130      2.7      0.0              break
   186
   187      1474         4474      3.0      0.2          for inner_iter in range(1, 20):
   188      1474        83833     56.9      4.4              Hn = H - alpha * grad
   189                                                       # Hn = np.where(Hn > 0, Hn, 0)
   190      1474       194239    131.8     10.1              Hn = _pos(Hn)
   191      1474        48858     33.1      2.5              d = Hn - H
   192      1474       150407    102.0      7.8              gradd = np.sum(grad * d)
   193      1474       515390    349.7     26.9              dQd = np.sum(np.dot(WtW, d) * d)
   ...

通过查看% Time列的顶部值，很容易找到最昂贵的表达式，这些表达式值得额外关注。

内存使用情况分析#

借助memory_profiler，您可以详细分析任何Python代码的内存使用情况。首先，安装最新版本

pip install -U memory_profiler

然后，以类似于line_profiler的方式设置魔法命令。

在IPython 0.11+版本下，首先创建一个配置配置文件

ipython profile create

然后在~/.ipython/profile_default/ipython_config.py中与line profiler一起注册扩展。

c.TerminalIPythonApp.extensions.append('memory_profiler')
c.InteractiveShellApp.extensions.append('memory_profiler')

这将在IPython终端应用程序和其他前端（如qtconsole和notebook）中注册%memit和%mprun魔法命令。

%mprun用于逐行检查程序中关键函数的内存使用情况。它与上一节中讨论的%lprun非常相似。例如，来自memory_profiler examples目录

In [1] from example import my_func

In [2] %mprun -f my_func my_func()
Filename: example.py

Line #    Mem usage  Increment   Line Contents
==============================================
     3                           @profile
     4      5.97 MB    0.00 MB   def my_func():
     5     13.61 MB    7.64 MB       a = [1] * (10 ** 6)
     6    166.20 MB  152.59 MB       b = [2] * (2 * 10 ** 7)
     7     13.61 MB -152.59 MB       del b
     8     13.61 MB    0.00 MB       return a

另一个有用的魔法命令是memory_profiler定义的%memit，它类似于%timeit。它可以按如下方式使用

In [1]: import numpy as np

In [2]: %memit np.zeros(1e7)
maximum of 3: 76.402344 MB per loop

有关更多详细信息，请使用%memit?和%mprun?查看魔法命令的文档字符串。

使用Cython#

如果Python代码的分析表明，Python解释器的开销比实际数值计算的成本大一个数量级或更多（例如，for循环遍历向量分量、嵌套求值条件表达式、标量算术……），则可能需要将代码的热点部分作为独立函数提取到.pyx文件中，添加静态类型声明，然后使用Cython生成适合编译为Python扩展模块的C程序。

Cython的文档包含开发此类模块的教程和参考指南。有关为scikit-learn开发Cython的更多信息，请参见Cython最佳实践、约定和知识。

分析编译后的扩展#

当使用编译后的扩展（用C/C++编写并带有包装器，或直接作为Cython扩展）时，默认的Python分析器是无用的：我们需要一个专用工具来检查编译后的扩展本身内部发生了什么。

使用yep和gperftools#

无需特殊的编译选项即可轻松进行分析，使用yep

使用调试器，gdb#

使用gdb进行调试很有帮助。为此，必须使用带有调试支持（调试符号和适当的优化）的Python解释器。要创建一个新的conda环境（在构建/安装后可能需要停用并重新激活），其中包含一个源代码构建的CPython解释器
```
git clone https://github.com/python/cpython.git
conda create -n debug-scikit-dev
conda activate debug-scikit-dev
cd cpython
mkdir debug
cd debug
../configure --prefix=$CONDA_PREFIX --with-pydebug
make EXTRA_CFLAGS='-DPy_DEBUG' -j<num_cores>
make install
```

使用gprof#

为了分析编译后的Python扩展，可以在使用gcc -pg重新编译项目并在debian/ubuntu上使用python-dbg版本的解释器后使用gprof：但是，这种方法也需要使用-pg重新编译numpy和scipy，这实现起来相当复杂。

幸运的是，存在两个不需要重新编译所有内容的替代分析器。

使用valgrind/callgrind/kcachegrind#

kcachegrind#

yep可用于创建分析报告。kcachegrind提供了一个图形化环境来可视化此报告。

# Run yep to profile some python script
python -m yep -c my_file.py

# open my_file.py.callgrin with kcachegrind
kcachegrind my_file.py.prof

注意

yep可以使用参数--lines或-l来编译“逐行”的分析报告。

使用`joblib.Parallel`进行多核并行化#

参见joblib文档

一个简单的算法技巧：温启动#

参见warm_start的词汇表条目