sklearn.datasets.make_classification函数用法简介
郝伟 2021/08/08

简介

sklearn.datasets.make_classification函数主要作用：随机生成n类分类问题的数据。最初创建关于边长为 2*class_sep 的 n_informative 维超立方体的顶点正态分布 (std=1) 的点簇，并为每个类分配相同数量的簇。它引入了这些特征之间的相互依赖性，并为数据添加了各种类型的进一步噪声。在没有改组的情况下，X 按以下顺序水平堆叠特征：主要的 n_informative 特征，然后是信息特征的 n_redundant 线性组合，然后是 n_repeated 重复项，随机抽取并替换信息和冗余特征。其余特征充满随机噪声。因此，无需改组，所有有用的特征都包含在列 X[:, :n_informative + n_redundant + n_repeated] 中。

接口说明

参数说明

函数的参数非常多，所有参数如下：

n_samples, 数据类型=int, 可选 (默认值=100)
样本个数。
n_features, 数据类型=int, 可选 (默认值=20)
特征总数，包括 n_informative 个信息特征、n_redundant 个冗余特征、n_repeated 个重复特征和随机抽取的 n_features-n_informative-n_redundant-n_repeated 无用特征。
n_informative, 数据类型=int, 可选 (默认值=2)
信息特征的数量。每个类由多个高斯簇组成，每个簇位于 n_informative 维子空间中超立方体的顶点周围。对于每个集群，信息特征从 N(0, 1) 中独立抽取，然后在每个集群内随机线性组合以增加协方差, 然后将簇放置在超立方体的顶点上。
n_redundant, 数据类型=int, 可选 (默认值=2)
冗余特征的数量。这些特征是作为信息特征的随机线性组合生成的。
n_repeated, 数据类型=int, 可选 (默认值=0)
从信息和冗余特征中随机抽取的重复特征的数量。
n_classes, 数据类型=int, 可选 (默认值=2)
分类问题的类（或标签）数。
n_clusters_per_class, 数据类型=int, 可选 (默认值=2)
每个类的簇数。
weights, 数据类型=list of floats or None (默认值=None)
分配给每个类的样本比例。如果没有，则类是平衡的。注: 若 len(weights) == n_classes - 1，则自动推断最后一个类的权重。如果权重之和超过 1，则可能返回超过 n_samples 个样本。
flip_y, 数据类型=float, 可选 (默认值=0.01)
随机交换类别的样本的比例。较大的值会在标签中引入噪声并使分类任务更加困难。
class_sep, 数据类型=float, 可选 (默认值=1.0)
乘以超立方体大小的因子。较大的值会分散集群/类并使分类任务更容易。
hypercube, 数据类型=boolean, 可选 (默认值=True)
如果为 True，则将簇放在超立方体的顶点上。如果为 False，则将簇放在随机多面体的顶点上。
shift, 数据类型=float, array of shape [n_features] or None, 可选 (默认值=0.0)
按指定值移动要素。如果为 None，则特征会移动 [-class_sep, class_sep] 中绘制的随机值。
scale, 数据类型=float, array of shape [n_features] or None, 可选 (默认值=1.0)
将要素乘以指定的值。如果没有，则特征按 [1, 100] 中绘制的随机值进行缩放。请注意，缩放发生在移位之后。
shuffle, 数据类型=boolean, 可选 (默认值=True)
打乱样本和特征。
random_state, 数据类型=int, RandomState instance or None (默认值)
确定用于数据集创建的随机数生成，具体说明参见这里，一般使用默认值None即可（即使用全局的 numpy.random）。如果指定了值（流行值通常为0或42），那么生成的序列是固定的，这在测试的时候会比较有用。

使用经验

虽然参数非常多，但是在使用的时候基本就以下几个

n_samples
n_features
n_informative
n_redundant
n_repeated
n_classes
n_clusters_per_class
random_state

示例

这是一个显示多张效果图的示例代码，显示效果在代码后。

#作为参考，这里还加入了两个类似的函数：
#make_blobs：生成简化的变量。
#make_multilabel_classification：独立生成多个标签。

import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.datasets import make_blobs
from sklearn.datasets import make_gaussian_quantiles

fontSize='small'
plt.figure(figsize=(8, 8))
plt.subplots_adjust(bottom=.05, top=.9, left=.05, right=.95)

# 图1：1个信息特征，每个类1个集群
plt.subplot(321)
plt.title("One informative feature, one cluster per class", fontsize=fontSize)
xs1, ys1 = make_classification(n_features=2, n_redundant=0, n_informative=1, n_clusters_per_class=1, shuffle=False)
plt.scatter(xs1[:, 0], xs1[:, 1], marker='o', c=ys1,  s=25, edgecolor='k')
 

# 图2：2个信息特征，每个类1个集群
plt.subplot(322)
plt.title("Two informative features, one cluster per class", fontsize=fontSize)
xs1, ys1 = make_classification(n_features=2, n_redundant=0, n_informative=2, n_clusters_per_class=1)
plt.scatter(xs1[:, 0], xs1[:, 1], marker='o', c=ys1, s=25, edgecolor='k')

# 图3：2个信息特征，每个类2个集群
plt.subplot(323)
plt.title("Two informative features, two clusters per class", fontsize=fontSize)
xs2, ys2 = make_classification(n_features=2, n_redundant=0, n_informative=2)
plt.scatter(xs2[:, 0], xs2[:, 1], marker='o', c=ys2, s=25, edgecolor='k')

# 图4：多分类，2个信息特征，1个集群
plt.subplot(324)
plt.title("Multi-class, two informative features, one cluster",fontsize=fontSize)
xs1, ys1 = make_classification(n_features=2, n_redundant=0, n_informative=2, n_clusters_per_class=1, n_classes=3)
plt.scatter(xs1[:, 0], xs1[:, 1], marker='o', c=ys1,  s=25, edgecolor='k')

# 图1：3个斑点
plt.subplot(325)
plt.title("Three blobs", fontsize=fontSize)
xs1, ys1 = make_blobs(n_features=2, centers=3)
plt.scatter(xs1[:, 0], xs1[:, 1], marker='o', c=ys1,  s=25, edgecolor='k')

# 图1：高斯划分三等分
plt.subplot(326)
plt.title("Gaussian divided into three quantiles", fontsize=fontSize)
xs1, ys1 = make_gaussian_quantiles(n_features=2, n_classes=3)
plt.scatter(xs1[:, 0], xs1[:, 1], marker='o', c=ys1,  s=25, edgecolor='k')

plt.show()

▶︎

all

running...