fetch_rcv1#

sklearn.datasets.fetch_rcv1(*, data_home=None, subset='all', download_if_missing=True, random_state=None, shuffle=False, return_X_y=False, n_retries=3, delay=1.0)[source]#

加载 RCV1 多标签数据集（分类）。

Download it if necessary.

版本：RCV1-v2，向量，完整集，主题多标签。

类别数	103
样本总数	804414
维度	47236
特征值范围	实数，介于 0 和 1 之间

在用户指南中阅读更多内容。

版本0.17中新增。

参数:

data_homestr or path-like, default=None: 为数据集指定另一个下载和缓存文件夹。默认情况下，所有 scikit-learn 数据都存储在 ‘~/scikit_learn_data’ 子文件夹中。
subset{‘train’, ‘test’, ‘all’}, default=’all’: 选择要加载的数据集：‘train’表示训练集（23149个样本），‘test’表示测试集（781265个样本），‘all’表示两者，如果shuffle为False，则训练样本在前。这遵循官方的LYRL2004时间顺序划分。
download_if_missingbool, default=True: If False, raise an OSError if the data is not locally available instead of trying to download the data from the source site.
random_stateint, RandomState instance or None, default=None: Determines random number generation for dataset shuffling. Pass an int for reproducible output across multiple function calls. See Glossary.
shufflebool, default=False: 是否打乱数据集。
return_X_ybool, default=False: 如果为True，返回 (dataset.data, dataset.target) 而不是Bunch对象。有关 dataset.data 和 dataset.target 对象的更多信息，请参阅下文。

0.20 版本新增。
n_retriesint, default=3: Number of retries when HTTP errors are encountered.

1.5 版本新增。
delayfloat, default=1.0: Number of seconds between retries.

1.5 版本新增。

返回:

datasetBunch

字典型对象。仅当 return_X_y 为False时返回。 dataset 具有以下属性

data形状为 (804414, 47236) 的稀疏矩阵, dtype=np.float64
数组有0.16%的非零值。将为CSR格式。
target形状为 (804414, 103) 的稀疏矩阵, dtype=np.uint8
每个样本在其类别中值为1，在其他类别中值为0。数组有3.15%的非零值。将为CSR格式。
sample_id形状为 (804414,) 的ndarray, dtype=np.uint32,
每个样本的识别号，按dataset.data中的顺序排列。
target_names形状为 (103,) 的ndarray, dtype=object
每个目标（RCV1主题）的名称，按dataset.target中的顺序排列。
DESCRstr
RCV1数据集的描述。

(data, target)tuple

由 dataset.data 和 dataset.target 组成的元组，如上所述。仅当 return_X_y 为True时返回。

0.20 版本新增。

示例

>>> from sklearn.datasets import fetch_rcv1
>>> rcv1 = fetch_rcv1()
>>> rcv1.data.shape
(804414, 47236)
>>> rcv1.target.shape
(804414, 103)

fetch_rcv1#

本页