【NLP】【DeepMind】1 Word Vector and Lexical Semantics

Natural language text(自然语言文本) = sequences od discrete sysbols(e.g. words)

Naive representation: one hot vectors(独热向量) in $R^{|vocabulary|}$ (very large)
Classical IR:document and query vectors are superpositions of word vectors.
Similarly for word classification problems(e.g. naive bayes topicmodels(主题模型))
Issues:sparse(稀疏), orthogonal representations, semantically weak.
richer representations expressing semantic Similarly.
Idea: produce dense vector representations based on the context/use of words.
Three main approaches:

count-based(基于计数的)
predictive(预测性的)
task-based(任务型的)

count-based methods

Define a basis vocabulary C of context words.
Define a word window size w.
目的是将跟目标语义不相关的单词移除窗口。
Count the basis vocabulary words occurring w words to the left or right or right of each instance of a target word in the corpus.(统计在目标词汇的两侧窗口(w的词距)中基本词出现的次数)
From a vector representation of the target word based on these counts(并将这个数量表示为矢量).
Use inner produce or cosine as similarly kernel.(e.g. cosine)

cosine has the advantage that it’s a norm-invariant metric.

issues:not all features are equal

Many normalization methods(来自于统计学的适量标准化方法): TF-IDF, PMI, etc.
Some remove the need for norm-invariant similarly metrics.

Neural Embedding models(神经词向量模型)

Learning count based vectors produces an embedding matrix in $R^{|vocabulary|}$
raw = word vectors
symbols = unique vectors
representation = embedding symbols with E

one generic idea behind embedding learning:
generic

collect insatnces
for each instance, collect it’s context words
define some score function score with upper bound on output
define a loss
easy to design a useless scorer(e.g. ignore input, output upper bound).

ideally, score:

embeds t with E(obviously).评分函数要明显的用到优化后的词向量矩阵t,并得到一个表征。
produces a score which is a function of how well t is accounted for by c.得出的分数要有说明性。
requires the word to account for the context (or the reverse) more than another word in the same place.评价上下文可以将单词解释的怎么样，反映在自然语言中的使用情况怎么样。
produces a loss which is differentiable w.r.t. theta and E.

estimate E
use the estimate E as your embedding matrix

paper: Natural Language Processing (Almost) from Scratch
C&W model
c&w ?

paper: Mikolov et al. 2013
CBoW model
CBoW ?