【NLP】【CS224N】2.4 word2vec

NLTK

自带语料库，词性分类库
自带分类，分词
社区支持
简单版wrapper

文本处理流程

预处理preprocess - 分词tokenize - … - 特征工程 make feature - ML labels/target

分词 tokenize

nltk.word_tokenize(sentence)

jieba
jieba.cut(sentence, cut_all=True) 全模式
jieba.cut(sentence, cut_all=False) 精确模式
jieba.cut(sentence)默认是精确模式
jieba.cut_for_search(sentence) 搜索引擎模式

stemming
lemma
POS tag
stopword 停止词
from nltk.corpus import stopwords
filtered_words = [word for word in word_list if word not in stopwords.words(‘english’)]

⼀条typical的⽂本预处理流⽔线

Raw_text - tokenize (- POS tag )- lemma/stemming - stopword - word_list

NLTK在NLP上的经典应⽤

情感分析
文本相似度
文本分类

情感分析

最简单的 sentiment dictionary
like 1
good 2
bad -2
terrible -3
类似于关键词打分机制
⽐如： AFINN-111
http://www2.imm.dtu.dk/pubdb/views/publication_details.php?id=6010

sentiment_dictionary = {}  
for line in open('data/AFINN-111.txt')
word, score = line.split('\t')
sentiment_dictionary[word] = int(score)
# 把这个打分表记录在⼀个Dict上以后
# 跑⼀遍整个句⼦，把对应的值相加
total_score = sum(sentiment_dictionary.get(word, 0) for word in words)
# 有值就是Dict中的值，没有就是0
# 于是你就得到了⼀个 sentiment score

from nltk.classify import NaiveBayesClassifier
# 随⼿造点训练集
s1 = 'this is a good book'
s2 = 'this is a awesome book'
s3 = 'this is a bad book'
s4 = 'this is a terrible book'
def preprocess(s):
# Func: 句⼦处理
# 这⾥简单的⽤了split(), 把句⼦中每个单词分开
# 显然 还有更多的processing method可以⽤
    return {word: True for word in s.lower().split()}
# return⻓这样:
# {'this': True, 'is':True, 'a':True, 'good':True, 'book':True}
# 其中, 前⼀个叫fname, 对应每个出现的⽂本单词;
# 后⼀个叫fval, 指的是每个⽂本单词对应的值。
# 这⾥我们⽤最简单的True,来表示,这个词『出现在当前的句⼦中』的意义。
# 当然啦, 我们以后可以升级这个⽅程, 让它带有更加⽜逼的fval, ⽐如 word2vec
# 把训练集给做成标准形式
training_data = [[preprocess(s1), 'pos'],
[preprocess(s2), 'pos'],
[preprocess(s3), 'neg'],
[preprocess(s4), 'neg']]
# 喂给model吃
model = NaiveBayesClassifier.train(training_data)
# 打出结果
print(model.classify(preprocess('this is a good book')))

文本相似度

Frequency 频率统计

import nltk
from nltk import FreqDist
# 做个词库先
corpus = 'this is my sentence ' \
'this is my life ' \
'this is the day'
# 随便tokenize⼀下
# 显然, 正如上⽂提到,
# 这⾥可以根据需要做任何的preprocessing:
# stopwords, lemma, stemming, etc.
tokens = nltk.word_tokenize(corpus)
print(tokens)
# 得到token好的word list
# ['this', 'is', 'my', 'sentence',
# 'this', 'is', 'my', 'life', 'this',
# 'is', 'the', 'day']
# 借⽤NLTK的FreqDist统计⼀下⽂字出现的频率
fdist = FreqDist(tokens)
# 它就类似于⼀个Dict
# 带上某个单词, 可以看到它在整个⽂章中出现的次数
print(fdist['is'])
# 3
# 好, 此刻, 我们可以把最常⽤的50个单词拿出来
standard_freq_vector = fdist.most_common(50)
size = len(standard_freq_vector)
print(standard_freq_vector)
# [('is', 3), ('this', 3), ('my', 2),
# ('the', 1), ('day', 1), ('sentence', 1),
# ('life', 1)
# Func: 按照出现频率⼤⼩, 记录下每⼀个单词的位置
def position_lookup(v):
res = {}
counter = 0
for word in v:
res[word[0]] = counter
counter += 1
return res
# 把标准的单词位置记录下来
standard_position_dict = position_lookup(standard_freq_vector)
print(standard_position_dict)
# 得到⼀个位置对照表
# {'is': 0, 'the': 3, 'day': 4, 'this': 1,
# 'sentence': 5, 'my': 2, 'life': 6}
# 这时, 如果我们有个新句⼦:
sentence = 'this is cool'
# 先新建⼀个跟我们的标准vector同样⼤⼩的向量
freq_vector = [0] * size
# 简单的Preprocessing
tokens = nltk.word_tokenize(sentence)
# 对于这个新句⼦⾥的每⼀个单词
for word in tokens:
try:
# 如果在我们的词库⾥出现过
# 那么就在"标准位置"上+1
freq_vector[standard_position_dict[word]] += 1
except KeyError:
# 如果是个新词
# 就pass掉
continue
print(freq_vector)
# [1, 1, 0, 0, 0, 0, 0]
# 第⼀个位置代表 is, 出现了⼀次
# 第⼆个位置代表 this, 出现了⼀次
# 后⾯都⽊有

文本分类

TF-IDF

from nltk.text import TextCollection
# ⾸先, 把所有的⽂档放到TextCollection类中。
# 这个类会⾃动帮你断句, 做统计, 做计算
corpus = TextCollection(['this is sentence one',
'this is sentence two',
'this is sentence three'])
# 直接就能算出tfidf
# (term: ⼀句话中的某个term, text: 这句话)
print(corpus.tf_idf('this', 'this is sentence four'))
# 0.444342
# 同理, 怎么得到⼀个标准⼤⼩的vector来表示所有的句⼦?
# 对于每个新句⼦
new_sentence = 'this is sentence five'
# 遍历⼀遍所有的vocabulary中的词:
for word in standard_vocab:
print(corpus.tf_idf(word, new_sentence))
# 我们会得到⼀个巨⻓(=所有vocab⻓度)的向量

可能的ML模型:

SVM
LR
RF
MLP
LSTM
RNN
….

自动摘要
指代消解
机器翻译
词性标注
分词
主题识别
文本分类
…

NLP处理方法

传统：基于规则
现代：基于统计机器学习
HMM, CRF, SVM, LDA, CNN …
规则隐含在模型参数里

word encoding 词编码需要保证词的相似性

向量空间子结构
Vking - Vqueen + Vwomen = Vman
Vparis - Vfrance + Vgerman = Vberlin
最终目标：词向量表示作为机器学习、特别是深度学习的输入和表示空间

离散表示：one-hot bag of words bigram和N-gram

one-hot

语料库 jon liskes to watch moving. mary likes too. 词典 {“john”:1,”like”:2,…} one-hot表示 [0,1,0,0,0,0,0,0,0,0,0]

文档的向量表示可以直接将各词的词向量表示加和

john likes to watch movies. mary likes too. [1,2,1,1,1,0,0,0,1,1]

词权重

TF-IDF(term frequency-inverse document frequency) 常用于信息检索
binary weighting 常用于短文本相似性 bernoulli naive bayes

bigram N-gram

优点：考虑词的顺序
缺点：词表的膨胀

语言模型

离散表示的问题

无法衡量词向量之间的关系
词表的维度随着语料库增长膨胀
n-gram词序列随语料库膨胀更快
数据稀疏问题

分布式表达 distributed representation

车 = 颜色 * 型号 * 车型

用一个词附近的词来表示该词。

you shall know a word by the company it keeps. –J.R.Firth 1957:11

共现矩阵(cocurrence matrix)

word document的共现矩阵的主要作用是发现主题(topic)，用于主题模型，如LSA(Latent Aemantic Analysis)

将共现矩阵行(列)作为词向量存在的问题：

向量维数随着词典大小线性增长 –降维 SVD降维
存储整个词典的空间消耗非常大
一些模型如文本分类模型会面临稀疏性问题
模型会欠稳定

SVD降维的问题

计算量随语料库和词典增长膨胀太快，对X(n,n)维的矩阵，计算量O(n^3)。而对于大型语料库n~400k，语料库大小1~60B token。
难以为词典中新加入的词分配词向量
与其他深度学习模型框架差异大

NNLM(neural network language model) A Neural Probabilistic Language Model

直接从语言模型出发，将模型最优化过程转化为求磁向量表示的过程。

目标函数： $L(\theta) = \sum_{t}{logP(W_t|W_{t-n+1}, …W_{t-1})}
使用了非对称的前向窗口函数，窗长度为n-1
滑动窗口遍历整个语料库求和，计算量正比于语料库大小
概率P满足归一化条件，这样不同位置t处的概率才能相加，即

$\sum_{w-{vocabulary}}{P(W|W_{t-n+1}, …,W_{t-1})} = 1

CBOW & Skip-gram

word2vec存在的问题

对于每个local context windows 单独训练，没有利用包含在global co-currence 矩阵中的统计信息。 GloVe
对于多一次无法很好的处理

词嵌入可视化

词向
比较级和最高级

词嵌入效果评估

词类比任务
词相似度任务

GloVe 和 Word2Vec

GloVe随着迭代次数增加，精度提升
Word2Vec未使用类似迭代次数的Epoch，用Negative Samples模拟

总结

离散表示
- One-hot representation, Bag of Words Unigram 语言模型
- N-gram词向量表示和语言模型
- co-currence矩阵的行(列)向量作为词向量
分布式连续表示
- co-currence矩阵的SVD降维的低维词向量表示
- Word2Vec:Continue Bag of Words Model
- Word2Vec:Skip-gram Model

https://code.google.com/archive/p/word2vec/ https://github.com/dav/word2vec

git clone https://github.com/dav/word2vec
cd word2vec/src
Make
./demo-word.sh
./demo-phrases.sh

python: gensim