CountVectorizer详解郝伟 2021/06/07

[TOC]

1. 简介

在 sklearn 库中，有一个非常常用的类 CountVectorizer，其作用是根据词的数量创建词的数量矢量。 CountVectorizer 中文可以译为 词频矢量。词频矢量顾名思义就是对每个单词出现的频率进行统计，以并生成一个特征矩阵。具体过程为先训练文本中的单词（不考虑顺序），将每个出现过的词汇单独视为一列特征，构成一个词汇表(vocabulary list)，所以该方法又称为词袋法(Bag of Words)，本文就其原理和使用方法进行介绍。

CountVectorizer的详细构造函数如下所示：

class sklearn.feature_extraction.text.CountVectorizer(
    input=u'content', 
    encoding=u'utf-8', 
    decode_error=u'strict',
    strip_accents=None, 
    lowercase=True, 
    preprocessor=None, 
    tokenizer=None, 
    stop_words=None,
    token_pattern=u'(?u)\b\w\w+\b', 
    ngram_range=(1, 1), 
    analyzer=u'word',
    max_df=1.0, 
    min_df=1,
    max_features=None, 
    vocabulary=None, 
    binary=False, 
    dtype=<type 'numpy.int64'>)

2. 从一个示例开始

其具体作用先让我们通过一个示例来进行了解。

from sklearn.feature_extraction.text import CountVectorizer

texts=["This is a book.",
       "That is a pen.",
       "What is that? It is a pen.", 
       "这是一本书。"] 

#创建词袋对象
cv = CountVectorizer()

# 对texts进行拟合
cv.fit(texts)

# 输出所有的特征名称：['book', 'is', 'pen', 'that', 'this', 'what', '这是一本书']
print(cv.get_feature_names())    

# 输出词汇字典
# {'this': 4, 'is': 1, 'book': 0, 'that': 3, 'pen': 2, 'what': 5, '这是一本书': 6}
print(cv.vocabulary_)

# 利用已经拟合好的cv将texts转换为词频矢量
text_matrix=cv.transform(texts)

# 显示 cv_fit 稀疏矩阵
print(text_matrix)

# 输出二维矩阵
print(text_matrix.toarray()) 

# fit_transform(texts) 等价于 cv.fit(texts) + cv.transform(texts)
#cv_fit=cv.fit_transform(texts)

输出如下

['book', 'is', 'it', 'pen', 'that', 'this', 'what', '这是一本书']
{'this': 5, 'is': 1, 'book': 0, 'what': 6, 'it': 2, 'pen': 3, 'that': 4, '这是一本书': 7}
  (0, 0)        1
  (0, 1)        1
  (0, 5)        1
  (1, 1)        2
  (1, 2)        1
  (1, 3)        1
  (1, 5)        1
  (1, 6)        1
  (2, 1)        2
  (2, 2)        1
  (2, 3)        1
  (2, 4)        1
  (2, 6)        1
  (3, 7)        1
[[1 1 0 0 0 1 0 0]
 [0 1 0 1 1 0 0 0]
 [0 2 1 1 1 0 1 0]
 [0 0 0 0 0 0 0 1]]

3. 代码分析

我们分析的目标是理解CountVectorizer的具体的工作方式。

3.1. 初始化

首先，在第9行初始化一个CountVectorizer的实例cv。这与一般面向对象的类的实例化是完全一样的。

3.2. 拟合

第12行就是进行拟合，拟合的原理是对所有的训练文本的词频进行统计。训练完成后，就可以得到训练文本的特征信息，包括所有单词和词汇字典。

所有单词，可以使用 cv.get_feature_names() 查看，见第15行；
词汇字典，可以使用 cv.vocabulary_ 查看，见第19行。

所以拟合的作用就是学习输入的训练文本的特征，因此在 cv.fit(texts) 完成后 cv 中已经包含了这些重要的特征信息。

3.3. 转换

拟合 cv 的目标就是为了将目标文本转换为词频矢量，具体实现见第22行。转换后，输出为一个矢量的二维矩阵。需要注意的是，当文本量很大的时候，这个矩阵基本都是非常稀疏的矩阵（可以思考下这是为什么）。所以使用二维数组的方式会占用$O(n^2)$内存空间，因此这里使用的是行矩阵进行表示，输出为：

  (0, 0)        1
  (0, 1)        1
  (0, 5)        1
  (1, 1)        1
  (1, 3)        1
  (1, 4)        1
  (2, 1)        2
  (2, 2)        1
  (2, 3)        1
  (2, 4)        1
  (2, 6)        1
  (3, 7)        1

在第28行调用了 text_matrix.toarray() 方法以后，即可生成二维矩阵：

[[1 1 0 0 0 1 0 0]
 [0 1 0 1 1 0 0 0]
 [0 2 1 1 1 0 1 0]
 [0 0 0 0 0 0 0 1]]

通过对比

{
  'book':      0, 
  'is':        1, 
  'it':        2, 
  'pen':       3, 
  'that':      4, 
  'this':      5, 
  'what':      6, 
  '这是一本书': 7
}

可以发现，每个单词都有一个对应的唯一编号，一共有8个单词。所以每个字符串就可以使用长度为8的字符串表示，即：

[0 0 0 0 0 0 0 0]

中的每一位分别对应下面的每个单词。

['book', 'is', 'it', 'pen', 'that', 'this', 'what']

举例来说，第1个句子 This is a book. 中 This 表示第6位为1， is 表示第1位为1，book表示第0位为1，所以对应的矢量就是 [1 1 0 0 0 1 0 0]。

后面的句子以此类推，从而得到

[[1 1 0 0 0 1 0 0]  # This5 is1 a book0.
 [0 1 0 1 1 0 0 0]  # That4 is1 a pen3
 [0 2 1 1 1 0 1 0]  # What6 is1 that4? It2 is1 a pen3.
 [0 0 0 0 0 0 0 1]] # 这是一本书7

上面单词后面的数字表示对应的位置的值设置为1. 我们还可以发现，矩阵中有一个2，这是因为在句子 what is that? It is a pen. 中 is 出现了2次。

4. 转换其他数据

使用训练好的cv可以对其他的文本进行矢量转换，请看以下示例。

print('tranform demo'.center(60, "*"))
s=['What is a pencil?']
m1 = cv.transform(s)
print(m1.toarray())

输出为：

***********************tranform demo************************
[[0 1 0 0 0 0 1 0]]

唯一需要注意的是如果文本内容在cv的特征词中不存在则不显示，比如 pencil。

5. 其他结论

标点符号会被当成分隔符；
如果一个单词在一行中出现多次，则进行累加，如第2和第3句中的 is；
CountVectorizer 不支持中文分词；
大小写默认不区分，可使用 lowercase=False 进行区分。

顺便再提一句，由于词袋法不考虑顺序，所以就诞生了另一种流程的词向量法：One-Hot编码。

6. 参考资料

[1] CountVectorizer API 介绍, https://zhuanlan.zhihu.com/p/37644086

CountVectorizer详解