7. Dataset loading utilities

2024-03-06 16:13| 来源: 网络整理| 查看: 265

7. Dataset loading utilities¶

The sklearn.datasets package embeds some small toy datasets as introduced in the Getting Started section.

This package also features helpers to fetch larger datasets commonly used by the machine learning community to benchmark algorithms on data that comes from the ‘real world’.

To evaluate the impact of the scale of the dataset (n_samples and n_features) while controlling the statistical properties of the data (typically the correlation and informativeness of the features), it is also possible to generate synthetic data.

General dataset API. There are three main kinds of dataset interfaces that can be used to get datasets depending on the desired type of dataset.

The dataset loaders. They can be used to load small standard datasets, described in the Toy datasets section.

The dataset fetchers. They can be used to download and load larger datasets, described in the Real world datasets section.

Both loaders and fetchers functions return a Bunch object holding at least two items: an array of shape n_samples * n_features with key data (except for 20newsgroups) and a numpy array of length n_samples, containing the target values, with key target.

The Bunch object is a dictionary that exposes its keys as attributes. For more information about Bunch object, see Bunch.

It’s also possible for almost all of these function to constrain the output to be a tuple containing only the data and the target, by setting the return_X_y parameter to True.

The datasets also contain a full description in their DESCR attribute and some contain feature_names and target_names. See the dataset descriptions below for details.

The dataset generation functions. They can be used to generate controlled synthetic datasets, described in the Generated datasets section.

These functions return a tuple (X, y) consisting of a n_samples * n_features numpy array X and an array of length n_samples containing the targets y.

In addition, there are also miscellaneous tools to load datasets of other formats or from other locations, described in the Loading other datasets section.

7.1. Toy datasets 7.1.1. Iris plants dataset 7.1.2. Diabetes dataset 7.1.3. Optical recognition of handwritten digits dataset 7.1.4. Linnerrud dataset 7.1.5. Wine recognition dataset 7.1.6. Breast cancer wisconsin (diagnostic) dataset 7.2. Real world datasets 7.2.1. The Olivetti faces dataset 7.2.2. The 20 newsgroups text dataset 7.2.3. The Labeled Faces in the Wild face recognition dataset 7.2.4. Forest covertypes 7.2.5. RCV1 dataset 7.2.6. Kddcup 99 dataset 7.2.7. California Housing dataset 7.2.8. Species distribution dataset 7.3. Generated datasets 7.3.1. Generators for classification and clustering 7.3.2. Generators for regression 7.3.3. Generators for manifold learning 7.3.4. Generators for decomposition 7.4. Loading other datasets 7.4.1. Sample images 7.4.2. Datasets in svmlight / libsvm format 7.4.3. Downloading datasets from the openml.org repository 7.4.4. Loading from external datasets

【本文地址】

公司简介

联系我们