fetch_20newsgroups_vectorized#

sklearn.datasets.fetch_20newsgroups_vectorized(*, subset='train', remove=(), data_home=None, download_if_missing=True, return_X_y=False, normalize=True, as_frame=False, n_retries=3, delay=1.0)[source]#

加载和向量化 20 newsgroups 数据集（分类）。

Download it if necessary.

这是一个方便的函数；转换使用CountVectorizer的默认设置完成。对于更高级的用法（停用词过滤、n-gram提取等），请将fetch_20newsgroups与自定义的CountVectorizer、HashingVectorizer、TfidfTransformer或TfidfVectorizer结合使用。

除非normalize设置为False，否则使用sklearn.preprocessing.normalize对生成的计数进行归一化。

类别数	20
样本总数	18846
维度	130107
特征值范围	真实

在用户指南中阅读更多内容。

参数:

subset{‘train’, ‘test’, ‘all’}, default=’train’

选择要加载的数据集：‘train’表示训练集，‘test’表示测试集，‘all’表示两者，按随机顺序排列。

removetuple, default=()

可以包含（‘headers’、‘footers’、‘quotes’）的任何子集。这些是将被检测并从新闻组帖子中删除的文本类型，以防止分类器对元数据过拟合。

‘headers’删除新闻组标题，‘footers’删除帖子末尾看起来像签名的块，‘quotes’删除看起来像是引用另一个帖子的行。

data_homestr or path-like, default=None

指定数据集的下载和缓存文件夹。如果为None，则所有scikit-learn数据都存储在‘~/scikit_learn_data’子文件夹中。

download_if_missingbool, default=True

If False, raise an OSError if the data is not locally available instead of trying to download the data from the source site.

return_X_ybool, default=False

If True, returns (data.data, data.target) instead of a Bunch object.

0.20 版本新增。

normalizebool, default=True

如果为True，使用sklearn.preprocessing.normalize将每个文档的特征向量归一化为单位范数。

版本 0.22 新增。

as_framebool, default=False

如果为True，数据是一个pandas DataFrame，包含具有适当dtype（数字、字符串或分类）的列。目标是一个pandas DataFrame或Series，具体取决于target_columns的数量。

0.24 版本新增。

n_retriesint, default=3

Number of retries when HTTP errors are encountered.

1.5 版本新增。

delayfloat, default=1.0

Number of seconds between retries.

1.5 版本新增。

返回:

bunchBunch

Dictionary-like object, with the following attributes.

data: {稀疏矩阵, dataframe} of shape (n_samples, n_features): 输入数据矩阵。如果as_frame为True，则data是一个带有稀疏列的pandas DataFrame。
target: {ndarray, series} of shape (n_samples,): 目标标签。如果as_frame为True，则target是一个pandas Series。
target_names: list of shape (n_classes,): The names of target classes.
DESCR: str: The full description of the dataset.
frame: dataframe of shape (n_samples, n_features + 1): 仅当as_frame=True时存在。包含data和target的Pandas DataFrame。

0.24 版本新增。

(data, target)tuple if return_X_y is True

data和target的格式如上文Bunch描述中所定义。

0.20 版本新增。

示例

>>> from sklearn.datasets import fetch_20newsgroups_vectorized
>>> newsgroups_vectorized = fetch_20newsgroups_vectorized(subset='test')
>>> newsgroups_vectorized.data.shape
(7532, 130107)
>>> newsgroups_vectorized.target.shape
(7532,)

Gallery examples#

模型复杂度影响

在 20newgroups 上进行多类稀疏逻辑回归

具有随机投影嵌入的 Johnson-Lindenstrauss 界限

fetch_20newsgroups_vectorized#

Gallery examples#

本页