反向传播与梯度计算

Posted on 2018-07-15 | In 公开课

矩阵计算基础

标量对向量的导数：标量对向量的每一维求导，结果是一个向量
向量对向量的导数：向量的每一维对向量的每一维求导，结果是一个矩阵
标量对矩阵的导数：标量对矩阵的每一个元素求导，结果是一个矩阵

【结论】求导结果总是和求导对象大小相同

常见的导数运算：

激活函数

定义 $h$ 是 $z$ 的函数 $h=f(z)$, $h$ 和$z$ 都是n维向量。求

$\frac{\partial h}{\partial z}$

注意 $\circ$ 是元素积，也叫哈达马相乘，顺便补充一下矩阵各种乘积

反向传播

以常见的三层NN为例来计算BP过程。

定义前向传播中每一步的函数：

$h_1 = \sigma(xW_1+b_1)$ $\hat{y} = softmax(h_1W_2+b_2)$ $J=CE(\hat{y}, y)$

从最后一层开始，每一层各自求导，再利用链式法则计算：
- 先对 $x$ 求导
- 再对 $W_2$, $b_2$, $W_1$, $b_1$ 求导

依存分析

Posted on 2018-07-15 | In 公开课

背景

什么是 dependency parsing

通过分析语言单位内成分之间的依存关系揭示其句法结构。即识别句子中的“主谓宾”、“定状补”这些语法成分，并分析各成分之间关系。
句法分析(symentic parsing)= 句法结构分析（syntactic structure parsing）+ 依存关系分析(dependency parsing)。
成分结构分析（constituent structure parsing）以获取整个句子的句法结构或者完全短语结构为目的的句法分析，又称短语结构分析（phrase structure parsing）；
依存分析（dependency parsing）以获取局部成分为目的的句法分析。

目前的句法分析已经从句法结构分析转向依存句法分析，一是因为通用数据集Treebank（Universal Dependencies treebanks）的发展，虽然该数据集的标注较为复杂，但是其标注结果可以用作多种任务（命名体识别或词性标注）且作为不同任务的评估数据，因而得到越来越多的应用，二是句法结构分析的语法集是由固定的语法集组成，较为固定和呆板；三是依存句法分析树标注简单且parser准确率高。

有哪些应用

主要是句法分析的应用，如机器翻译、信息检索与抽取、问答系统、语音识别。
在项目中使用的：NER（挖掘新词）

如何描述语法

有两种主流观点，其中一种是短语结构文法，英文术语是：Constituency = phrase structure grammar = context-free grammars (CFGs)。
这种短语语法用固定数量的rule分解句子为短语和单词、分解短语为更短的短语或单词……
从 none phrase 开始，后面可能会有 adjective none、prepositiontial phrase（可能又依赖于其他名词短语），

另一种是依存结构，用单词之间的依存关系来表达语法。如果一个单词修饰另一个单词，则称该单词依赖于另一个单词

细节

人们画依存句法树的弧的方式不同，这门课是head指向dependent（即箭头指向的词语是依赖者，箭头尾部的词语是被依赖者）。

每个句子都有一个虚根，代表句子之外的开始，这样句子中的每个单词都有自己的依存对象了。

defination

可用的特征

双词汇亲和（Bilexical affinities），比如discussion与issues。
词语间距，因为一般相邻的词语才具有依存关系
中间词语，如果中间词语是动词或标点，则两边的词语不太可能有依存
词语配价，一个词语最多有几个依赖者。

约束

ROOT只能被一个词依赖。
无环。

方法

Dynamic programming

估计是找出以某head结尾的字串对应的最可能的句法树。

Graph algorithms

最小生成树。

Constraint Satisfaction

估计是在某个图上逐步删除不符合要求的边，直到成为一棵树。

“Transition-based parsing” or “deterministic dependency parsing”

greedy choice of attachments guided by ml clf.

主流方法，基于贪心决策动作拼装句法树。
defination

step 1: build up structrue
step 2: train a classifier

已经有了 a tree bank with parses of sentences, 就根据 tree bank 来确定 which sequence of operations would give the correct parse of a sentence （transition的分类问题）

step 3: evaluation

evl

Unlabeled attachment score(UAS)= head

Labeled attachment score(LAS) = head and label

简单实现

import sys
import cPickle
import os
import time
import tensorflow as tf
import numpy as np
from model import Model
from utils.parser_utils import minibatches, load_and_preprocess_data

def xavier_weight_init():
    """Returns function that creates random tensor.

    The specified function will take in a shape (tuple or 1-d array) and
    returns a random tensor of the specified shape drawn from the
    Xavier initialization distribution.

    Hint: You might find tf.random_uniform useful.
    """
    def _xavier_initializer(shape, **kwargs):
        """Defines an initializer for the Xavier distribution.
        Specifically, the output should be sampled uniformly from [-epsilon, epsilon] where
            epsilon = sqrt(6) / <sum of the sizes of shape's dimensions>
        e.g., if shape = (2, 3), epsilon = sqrt(6 / (2 + 3))

        This function will be used as a variable initializer.

        Args:
            shape: Tuple or 1-d array that species the dimensions of the requested tensor.
        Returns:
            out: tf.Tensor of specified shape sampled from the Xavier distribution.
        """
        epsilon = np.sqrt(6 / np.sum(shape))
        out_t = tf.Variable(tf.random_uniform(shape=shape, minval=-epsilon, maxval=epsilon))
        return out_t
    return _xavier_initializer

class Config(object):
    """Holds model hyperparams and data information.

    The config class is used to store various hyperparameters and dataset
    information parameters. Model objects are passed a Config() object at
    instantiation. They can then call self.config.<hyperparameter_name> to
    get the hyperparameter settings.
    """
    n_features = 36
    n_classes = 3
    dropout = 0.5  # (p_drop in the handout)
    embed_size = 50
    hidden_size = 200
    batch_size = 1024
    n_epochs = 10
    lr = 0.0005

class ParserModel(Model):
    """
    Implements a feedforward neural network with an embedding layer and single hidden layer.
    This network will predict which transition should be applied to a given partial parse
    configuration.
    """

    def add_placeholders(self):
        """Generates placeholder variables to represent the input tensors"""
        self.input_placeholder = tf.placeholder(tf.int32, [None, self.config.n_features])
        self.labels_placeholder = tf.placeholder(tf.float32, [None, self.config.n_classes])
        self.droupout_placeholder = tf.placeholder(tf.float32)
        
    def create_feed_dict(self, inputs_batch, labels_batch=None, dropout=0):
        """Creates the feed_dict for the dependency parser.

        A feed_dict takes the form of:

        feed_dict = {
                <placeholder>: <tensor of values to be passed for placeholder>,
                ....
        }


        Hint: The keys for the feed_dict should be a subset of the placeholder tensors created in add_placeholders.
        Hint: When an argument is None, don't add it to the feed_dict.

        Args:
            inputs_batch: A batch of input data.
            labels_batch: A batch of label data.
            dropout: The dropout rate.
        Returns:
            feed_dict: The feed dictionary mapping from placeholders to values.
        """
        feed_dict = {self.input_placeholder: inputs_batch, \
                    self.droupout_placeholder: dropout}
        if labels_batch.any():
            feed_dict[self.labels_placeholder] = labels_batch
        return feed_dict
    
    def add_embedding(self):
        """Adds an embedding layer that maps from input tokens (integers) to vectors and then
        concatenates those vectors:
            - Creates a tf.Variable and initializes it with self.pretrained_embeddings.
            - Uses the input_placeholder to index into the embeddings tensor, resulting in a
              tensor of shape (None, n_features, embedding_size).
            - Concatenates the embeddings by reshaping the embeddings tensor to shape
              (None, n_features * embedding_size).

        Hint: You might find tf.nn.embedding_lookup useful.
        Hint: You can use tf.reshape to concatenate the vectors. 

        Returns:
            embeddings: tf.Tensor of shape (None, n_features*embed_size)
        """
        embedding = tf.Variable(self.pretrained_embeddings)
        embeddings = tf.nn.embedding_lookup(embedding, self.input_placeholder)
        embeddings = tf.reshape(embeddings, [-1, self.config.n_features * self.config.embed_size])
        return embeddings    
    
    def add_prediction_op(self):
        """Adds the 1-hidden-layer NN:
            h = Relu(xW + b1)
            h_drop = Dropout(h, dropout_rate)
            pred = h_dropU + b2

        Note that we are not applying a softmax to pred. The softmax will instead be done in
        the add_loss_op function, which improves efficiency because we can use
        tf.nn.softmax_cross_entropy_with_logits

        Use the initializer from q2_initialization.py to initialize W and U (you can initialize b1
        and b2 with zeros)

        Hint: Note that tf.nn.dropout takes the keep probability (1 - p_drop) as an argument.
              Therefore the keep probability should be set to the value of
              (1 - self.dropout_placeholder)

        Returns:
            pred: tf.Tensor of shape (batch_size, n_classes)
        """
        
        x = self.add_embedding()
        xavier = xavier_weight_init()
        with tf.variable_scope("transformation"):
            b1 = tf.Variable(tf.random_uniform([self.config.hidden_size]))
            b2 = tf.Variable(tf.random_uniform([self.config.n_classes]))
            
            W = xavier([self.config.n_features * self.config.embed_size, self.config.hidden_size])
            U = xavier([self.config.hidden_size, self.config.n_classes])
            
            z1 = tf.matmul(x, W) + b1
            h = tf.nn.relu(z1)
            h_dropout = tf.nn.dropout(h, keep_prob=(1-self.droupout_placeholder))
            pred = tf.matmul(h_dropout, U) + b2
        return pred
    
    def add_loss_op(self, pred):
        """Adds Ops for the loss function to the computational graph.
        In this case we are using cross entropy loss.
        The loss should be averaged over all examples in the current minibatch.

        Hint: You can use tf.nn.softmax_cross_entropy_with_logits to simplify your
                    implementation. You might find tf.reduce_mean useful.
        Args:
            pred: A tensor of shape (batch_size, n_classes) containing the output of the neural
                  network before the softmax layer.
        Returns:
            loss: A 0-d tensor (scalar)
        """
        loss = tf.nn.softmax_cross_entropy_with_logits(logits=pred, labels=self.labels_placeholder)
        loss = tf.reduce_mean(loss)
        return loss
    
    def add_training_op(self, loss):
        adam_optim = tf.train.AdamOptimizer(self.config.lr)
        train_op = adam_optim.minimize(loss)
        return train_op
    
    def train_on_batch(self, sess, inputs_batch, labels_batch):
        feed = self.create_feed_dict(inputs_batch, labels_batch=labels_batch,
                                     dropout=self.config.dropout)
        _, loss = sess.run([self.train_op, self.loss], feed_dict=feed)
        return loss
    
    def run_epoch(self, sess, parser, train_examples, dev_set):
        n_minibatches = 1 + len(train_examples) / self.config.batch_size
        prog = tf.keras.utils.Progbar(target=n_minibatches)
        for i, (train_x, train_y) in enumerate(minibatches(train_examples, self.config.batch_size)):
            loss = self.train_on_batch(sess, train_x, train_y)
            prog.update(i + 1, [("train loss", loss)])

        print "Evaluating on dev set",
        dev_UAS, _ = parser.parse(dev_set)
        print "- dev UAS: {:.2f}".format(dev_UAS * 100.0)
        return dev_UAS
    
    def fit(self, sess, saver, parser, train_examples, dev_set):
        best_dev_UAS = 0
        for epoch in range(self.config.n_epochs):
            print "Epoch {:} out of {:}".format(epoch + 1, self.config.n_epochs)
            dev_UAS = self.run_epoch(sess, parser, train_examples, dev_set)
            if dev_UAS > best_dev_UAS:
                best_dev_UAS = dev_UAS
                if saver:
                    print "New best dev UAS! Saving model in ./data/weights/parser.weights"
                    saver.save(sess, './data/weights/parser.weights')
            print
            
    def __init__(self, config, pretrained_embeddings):
        self.pretrained_embeddings = pretrained_embeddings
        self.config = config
        self.build()   

def main(debug=True):
    print 80 * "="
    print "INITIALIZING"
    print 80 * "="
    config = Config()
    parser, embeddings, train_examples, dev_set, test_set = load_and_preprocess_data(debug)
    if not os.path.exists('./data/weights/'):
        os.makedirs('./data/weights/')

    with tf.Graph().as_default() as graph:
        print "Building model...",
        start = time.time()
        model = ParserModel(config, embeddings)
        parser.model = model
        init_op = tf.global_variables_initializer()
        saver = None if debug else tf.train.Saver()
        print "took {:.2f} seconds\n".format(time.time() - start)
    graph.finalize()

    with tf.Session(graph=graph) as session:
        parser.session = session
        session.run(init_op)

        print 80 * "="
        print "TRAINING"
        print 80 * "="
        model.fit(session, saver, parser, train_examples, dev_set)
        
        if not debug:
            print 80 * "="
            print "TESTING"
            print 80 * "="
            print "Restoring the best model weights found on the dev set"
            saver.restore(session, './data/weights/parser.weights')
            print "Final evaluation on test set",
            UAS, dependencies = parser.parse(test_set)
            print "- test UAS: {:.2f}".format(UAS * 100.0)
            print "Writing predictions"
            with open('q2_test.predicted.pkl', 'w') as f:
                cPickle.dump(dependencies, f, -1)
            print "Done!"
class PartialParse(object):
    def __init__(self, sentence):
        """Initializes this partial parse.

        Your code should initialize the following fields:
            self.stack: The current stack represented as a list with the top of the stack as the
                        last element of the list.
            self.buffer: The current buffer represented as a list with the first item on the
                         buffer as the first item of the list
            self.dependencies: The list of dependencies produced so far. Represented as a list of
                    tuples where each tuple is of the form (head, dependent).
                    Order for this list doesn't matter.

        The root token should be represented with the string "ROOT"

        Args:
            sentence: The sentence to be parsed as a list of words.
                      Your code should not modify the sentence.
        """
        # The sentence being parsed is kept for bookkeeping purposes. Do not use it in your code.
        self.sentence = sentence

        self.stack = ['ROOT']
        self.buffer = sentence[:]
        self.dependencies = []
        
    def parse_step(self, transition):
        """Performs a single parse step by applying the given transition to this partial parse

        Args:
            transition: A string that equals "S", "LA", or "RA" representing the shift, left-arc,
                        and right-arc transitions. You can assume the provided transition is a legal
                        transition.
        """
        if transition == 'S':
            self.stack.append(self.buffer[0])
            self.buffer.pop(0)
        elif transition == 'LA':
            self.dependencies.append((self.stack[-1], self.stack[-2]))
            self.stack.pop(-2)
        else:
            self.dependencies.append((self.stack[-2], self.stack[-1]))
            self.stack.pop(-1)
            
    def parse(self, transitions):
        """Applies the provided transitions to this PartialParse

        Args:
            transitions: The list of transitions in the order they should be applied
        Returns:
            dependencies: The list of dependencies produced when parsing the sentence. Represented
                          as a list of tuples where each tuple is of the form (head, dependent)
        """
        for transition in transitions:
            self.parse_step(transition)
        return self.dependencies

def test_step(name, transition, stack, buf, deps,
              ex_stack, ex_buf, ex_deps):
    """Tests that a single parse step returns the expected output"""
    pp = PartialParse([])
    pp.stack, pp.buffer, pp.dependencies = stack, buf, deps

    pp.parse_step(transition)
    stack, buf, deps = (tuple(pp.stack), tuple(pp.buffer), tuple(sorted(pp.dependencies)))
    assert stack == ex_stack, \
        "{:} test resulted in stack {:}, expected {:}".format(name, stack, ex_stack)
    assert buf == ex_buf, \
        "{:} test resulted in buffer {:}, expected {:}".format(name, buf, ex_buf)
    assert deps == ex_deps, \
        "{:} test resulted in dependency list {:}, expected {:}".format(name, deps, ex_deps)
    print "{:} test passed!".format(name)

调用函数试一下：

test

Our network will predict which transition should be applied next to a partial parse. We could use it to parse a single sentence by applying predicted transitions until the parse is complete. However, neural networks run much more efficiently when making predictions about batches of data at a time (i.e., predicting the next transition for a many different partial parses simultaneously). We can parse sentences in minibatches with the following algorithm.

就是一个parser一次处理一个sentence，怎么能批量处理 sentences ~~ miniBatch ~~

miniBatchParsing

写个程序试一下

import copy
def minibatch_parse(sentences, model, batch_size):
    """Parses a list of sentences in minibatches using a model.

    Args:
        sentences: A list of sentences to be parsed (each sentence is a list of words)
        model: The model that makes parsing decisions. It is assumed to have a function
               model.predict(partial_parses) that takes in a list of PartialParses as input and
               returns a list of transitions predicted for each parse. That is, after calling
                   transitions = model.predict(partial_parses)
               transitions[i] will be the next transition to apply to partial_parses[i].
        batch_size: The number of PartialParses to include in each minibatch
    Returns:
        dependencies: A list where each element is the dependencies list for a parsed sentence.
                      Ordering should be the same as in sentences (i.e., dependencies[i] should
                      contain the parse for sentences[i]).
    """

    partial_parses = [PartialParse(s) for s in sentences]

    unfinished_parse = copy.copy(partial_parses)

    while len(unfinished_parse) > 0:
        minibatch = unfinished_parse[0:batch_size]
        # perform transition and single step parser on the minibatch until it is empty
        while len(minibatch) > 0:
            transitions = model.predict(minibatch)
            for index, action in enumerate(transitions):
                minibatch[index].parse_step(action)
            minibatch = [parse for parse in minibatch if len(parse.stack) > 1 or len(parse.buffer) > 0]

        # move to the next batch
        unfinished_parse = unfinished_parse[batch_size:]

    dependencies = []
    for n in range(len(sentences)):
        dependencies.append(partial_parses[n].dependencies)

    return dependencies

class DummyModel(object):
    """Dummy model for testing the minibatch_parse function
    First shifts everything onto the stack and then does exclusively right arcs if the first word of
    the sentence is "right", "left" if otherwise.
    """
    def predict(self, partial_parses):
        return [("RA" if pp.stack[1] is "right" else "LA") if len(pp.buffer) == 0 else "S" for pp in partial_parses]

def test_dependencies(name, deps, ex_deps):
    """Tests the provided dependencies match the expected dependencies"""
    deps = tuple(sorted(deps))
    assert deps == ex_deps, \
        "{:} test resulted in dependency list {:}, expected {:}".format(name, deps, ex_deps)

testminiBatchParsing

RNN系列

Posted on 2018-07-14 | In 公开课

基于固定窗口的NN语音模型和RNN语言模型

Bengio等人利用神经网络来表示语言模型，在语言模型的训练过程中可以得到单词的分布式表示，具体的神经概率语言模型图如下：

该网络的主要思想是用前$n−1$个词的向量来估计当前词的概率，具体公式为：

$\hat{y} =softmax(W_2*tanh(W_1+b_1)+W_3+b_3)$

可以看出传统语言模型在估计概率的时候需要固定$n$的大小，否则无法统计概率$P(w_i|w_{i−(n−1)},…,w_{i−1})$。而 RNN 的最大优势在于可以统计 $P(w_i|w_1,…,w_{i−1})$ 的概率。

RNN语言模型中非常关键的一点是每个时刻采用的 $W$ 矩阵都是一个，所以参数规模不会随着依赖上下文的长度增加而指数增长。

RNN 中的BP

多变量链式法则

如果对于函数 $f(x, y)$, 每个变量都是 $t$ 的函数，即 $x(t)$, $y(t)$, 那么

$\frac{d}{dt}f(x(t), y(t))=\frac{\partial f}{\partial x}\frac{\partial x}{\partial t} + \frac{\partial f}{\partial y}\frac{\partial y}{\partial t}$

对于 RNN 来说，$W_h$ 出现在每一个时刻，所以在计算时应该对每一次的梯度求和。

如果只考虑最后两个时刻$t$和$t-1$，那么：

$\frac{\partial J^{(t)}}{\partial W_h}=\sum^t _{i=t-1}|_{(i)}=\frac{\partial J^{(t)}}{\partial\theta^{(t)}}\frac{\partial \theta^{(t)}}{\partial h^{(t)}}(\frac{\partial h^{(t)}}{\partial z^{(t)}}\frac{\partial z^{(t)}}{\partial W_h}+\frac{\partial h^{(t)}}{\partial h^{(t-1)}}\frac{\partial h^{(t-1)}}{\partial z^{(t-1)}}\frac{\partial z^{(t-1)}}{\partial W_h})$

记 $\gamma^{(t)}=\frac{\partial J^{(t)}}{\partial h^{(t)}}$，考虑到梯度可以按时间展开，那么有 $\gamma^{(t-1)}=\frac{\partial J^{(t)}}{\partial h^{(t-1)}}$，所以上述式子可以简写为

$\frac{\partial J^{(t)}}{\partial W_h}=\sum^t _{i=t-1}|_{(i)}=\gamma^{(t)} \frac{\partial h^{(t)}}{\partial z^{(t)}}\frac{\partial z^{(t)}}{\partial W_h} + \gamma^{(t-1)} \frac{\partial h^{(t-1)}}{\partial z^{(t-1)}}\frac{\partial z^{(t-1)}}{\partial W_h}$

而由 $\frac{\partial h^{(t)}}{\partial h^{(t-1)}}$ 可以看出，这一项会带来梯度弥散或者梯度爆炸
上面的公式是为了详细说明下面的图：

梯度问题解决办法

梯度爆炸：当梯度超过一个阈值的时候，将梯度裁剪到一个合适的值
梯度弥散：使用 GRU 或者 LSTM

Q: 使用 $l2$ 正则会有助于改善梯度弥散吗？A: 不会，$l2$ 会让权重变为0，使情况更糟。

Q: 对于一个拥有2个隐层的NN来说，增加隐层个数会有助于改善梯度弥散吗？
A: 不会，梯度弥散就是因为层数太多引起的。

LSTM

通过更改 RNN 单元结构，使得从 $c_t$ 到 $c_{t-1}$ 的反向传播与 $W$ 无关

RNN和语言模型

Posted on 2018-06-25 | In 公开课

Overview

学习新的NLP任务：语言模型
新的NN：RNN

语言模型

定义：一种预测下一个词是什么的NLP任务
数学表示：
$P(x^{t+1}=w_j|x^t,...,x^1)$
应用：输入法提示、搜索提示

BoW 词袋模型

就是统计每一个词出现的次数

from nltk.corpus import reuters
from collections import Counter
import random

counts = Counter(reuters.words())
total_count = len(reuters.words())

# Compute the frequencies
for word in counts:
    counts[word] /= float(total_count)

# Generate 100 words of language
text = []

for _ in range(100):
    r = random.random()
    accumulator = .0

    for word, freq in counts.iteritems():
        accumulator += freq

        if accumulator >= r:
            text.append(word)
            break

print ' '.join(text)

运行结果如下

1
2
3

[(u'.', 94687), (u',', 72360), (u'the', 58251), (u'of', 35979), (u'to', 34035), (u'in', 26478), (u'said', 25224), (u'and', 25043), (u'a', 23492), (u'mln', 18037), (u'vs', 14120), (u'-', 13705), (u'for', 12785), (u'dlrs', 11730), (u"'", 11272), (u'The', 10968), (u'000', 10277), (u'1', 9977), (u's', 9298), (u'pct', 9093)]
1.0
held pct said is A 1986 . qtr of A oil will non revenues rate said 2 lt already net They subsidiary compared have were FORECAST net 109 being the cts to cuts , and more Government said CASH SEES in Middle in of 732 SHIPPING some to of 03 , national they and / released and to , > Walker > said stages 1 would its agricultural the 1986 contract February STAKE Gencorp ; Savings > share ; 3 annual VENTURE Trade given dlrs had ago . be Albion be on from said OF . 1985 area the over

从结果可以看出，结果只是根据词频生成的

n-gram 语言模型

为了更好的生成文本，一个改进的想法就是考虑上下文信息
定义： n-gram 就是一组词，每组词是连续的，数量为$n$
思路：基于统计的方法，根据每组词出现的频率来预测下一个词
前提假设：$x^{t+1}$ 只依赖于它前面 $(n-1)$ 个词

math

$n-gram$ 和 $(n-1)-gram$ 的概率可由频率近似替代（统计）

优缺点

$n-gram$ 模型的优点包含了前 $(n-1)$ 个词所能提供的全部信息，这些信息对当前词出现具有很强的约束力。同时因为只看 $(n-1)$ 个词而不是所有词也使得模型的效率较高。

训练语料里面有些 n 元组没有出现过，其对应的条件概率就是 0，导致计算一整句话的概率为 0。解决这个问题有两种常用方法：

(1)方法一为平滑法。最简单的方法是把每个 $n$ 元组的出现次数加 1，那么原来出现 $k$ 次的某个 $n$ 元组就会记为 $k+1$ 次，原来出现 $0$ 次的 $n$ 元组就会记为出现 $1$ 次。这种也称为 $Laplace$ 平滑。当然还有很多更复杂的其他平滑方法，其本质都是将模型变为贝叶斯模型，通过引入先验分布打破似然一统天下的局面。而引入先验方法的不同也就产生了很多不同的平滑方法。

(2)方法二是回退法。有点像决策树中的后剪枝方法，即如果 $n$ 元的概率不到，那就往上回退一步，用 $n-1$ 元的概率乘上一个权重来模拟。

from nltk.corpus import reuters
from nltk import bigrams, trigrams
from collections import Counter, defaultdict
import random

model = defaultdict(lambda: defaultdict(lambda: 0))
for sentence in reuters.sents():
    for w1, w2, w3 in trigrams(sentence, pad_right=True, pad_left=True):
        model[(w1, w2)][w3] += 1

for w1_w2 in model:
    total_count = float(sum(model[w1_w2].values()))
    for w3 in model[w1_w2]:
        model[w1_w2][w3] /= total_count



text = [None, None]
sentence_finished = False

while not sentence_finished:
    r = random.random()
    accumulator = .0

    for word in model[tuple(text[-2:])].keys():
        accumulator += model[tuple(text[-2:])][word]

        if accumulator >= r:
            text.append(word)
            break

    if text[-2:] == [None, None]:
        sentence_finished = True


# add the probability computation
text = [None, None]
prob = 1.0  # <- Init probability

sentence_finished = False

while not sentence_finished:
    r = random.random()
    accumulator = .0

    for word in model[tuple(text[-2:])].keys():
        accumulator += model[tuple(text[-2:])][word]

        if accumulator >= r:
            prob *= model[tuple(text[-2:])][word]  # <- Update the probability with the conditional probability of the new word
            text.append(word)
            break

    if text[-2:] == [None, None]:
        sentence_finished = True

测试一下：

1 2	print "Probability of text=", prob # <- Print the probability of the text print ' '.join([t for t in text if t])

Probability of text= 1.82814299952e-05
He told a press conference .

神经语言模型

A fixed-window neural Language Model

neural

优点

词表示不再是稀疏的；模型复杂度从 $O(exp(n))$ 降低为 $O(n)$

缺点

窗口太小，但如果增大窗口 $W$ 也随之变大，而且窗口中每一个词 $x^{(i)}$ 使用的是 $W$ 中不同的行，也就是说 $W$ 不能共享

为了能重复的使用 $W$，提出了 $RNN$

优点

（1）输入可以是任意长度的

（2）模型规模不会随输入长度变化而变化

（3）“记忆能力”：当前时刻的计算使用过去的信息

（4）不同时刻的权重是共享的，即表示是共享的

缺点

（1）递归使得计算过程很慢

（2）记忆能力也是有限的

RNN 简介

递归神经网络（Recurrent Neural Networks，RNN）是两种人工神经网络的总称：时间递归神经网络（recurrent neural network）和结构递归神经网络（recursive neural network）。时间递归神经网络的神经元间连接构成有向图，而结构递归神经网络利用相似的神经网络结构递归构造更为复杂的深度网络。

RNN一般指代时间递归神经网络。单纯递归神经网络因为无法处理随着递归，权重指数级爆炸或消失的问题（Vanishing gradient problem），难以捕捉长期时间关联；而结合不同的LSTM可以很好解决这个问题。时间递归神经网络可以描述动态时间行为，因为和前馈神经网络（feedforward neural network）接受较特定结构的输入不同，RNN将状态在自身网络中循环传递，因此可以接受更广泛的时间序列结构输入。

HMM

RNN主要解决序列数据的处理，比如文本、语音、视频等等。这类数据的样本间存在顺序关系，每个样本和它之前的样本存在关联。比如说，在文本中，一个词和它前面的词是有关联的；在气象数据中，一天的气温和前几天的气温是有关联的。一组观察数据定义为一个序列，从分布中可以观察出多个序列。
隐马尔科夫模型（HMM）定义每个元素只和离它最近的$k$个元素相关，解决了复杂度暴增的问题.只考虑观察值$X$的模型有时表现力不足，因此需要加入隐变量，将观察值建模成由隐变量所生成。隐变量的好处在于，它的数量可以比观察值多，取值范围可以比观察值更广，能够更好的表达有限的观察值背后的复杂分布。加入了隐变量$h$的马尔科夫模型称为隐马尔科夫模型。隐马尔科夫模型实际上建模的是观察值$X$，隐变量$h$和模型参数 $\theta$ 的联合分布,HMM的模型长度$T$是事先固定的，模型参数不共享，其复杂度为 $O(T)$。

RNN

把序列视作时间序列，隐含层 $h$ 的自连接边实际上是和上一时刻的 $h$ 相连（上面左图）。在每一个时刻$t$，$h_t$ 的取值是当前时刻的输入 $x_t$ 和上一时刻的隐含层值 $h_{t-1}$ 的一个函数.

$h_t=F_{\theta}(h_{t-1}, x_t)$

将 $h$ 层的自连接展开，就成为了上图右边的样子，看上去和HMM很像。两者最大的区别在于，RNN的参数是跨时刻共享的。也就是说，对任意时刻 $t$ ，$h_{t-1}$ 到 $h_{t}$ 以及 $x_{t-1}$ 到 $h_{t}$ 的网络参数都是相同的。

训练 RNN-LM 的相关细节

在一组句子中的每个句子上而不是整个 corpus 上计算 loss 和 gradient
$J^{(t)}(\theta)$ 对权重矩阵 $W_h$ 的偏导，等于 $J^{(t)}(\theta)$ 对每个权重值 $w_i$ 的偏导的加和。
多变量的链式法则
反向传播：backpropagation through time
评价指标: 困惑度（一种信息论度量，用来测量一个一个概率模型预测样本的好坏，困惑度越低越好）

给定一个包含 $n$ 个词的文本语料 $w_1,…w_n$ 和一个基于词语历史的用于为此与分配概率的语言模型函数 $LM$ , $LM$ 在这个语料的困惑度是

$2^{- \frac{1}{n} \sum_{i=1}^nlog_2LM(w_i|w_{1:i-1})}$

Tips

困惑度与预料有关
困惑度适用于比对不同的语言模型，因为他有能力学会序列中的规律，但他不是一个很好的用于去评估语言理解或者语言处理任务的度量

NER

Posted on 2018-05-28 | In 公开课

这节课介绍了根据上下文预测单词分类的问题，与常见神经网络课程套路不同，以间隔最大化为目标函数，推导了对权值矩阵和词向量的梯度；初步展示了与传统机器学习方法不一样的风格。

前期知识准备

softmax 复习

softmax与交叉熵

用 softmax 分类时，常用交叉熵作为损失函数(目标函数)。
训练的时候，可以直接最小化正确类别的概率的负对数：

其实这个损失函数等效于交叉熵

这是因为类别是one-hot向量。

下面补充说明熵、KL散度、损失函数的一些问题

如何衡量真实概率和预测概率之间的差异
熵：可以表示一个事件A的自信息量，也就是A包含多少信息
$S(A)=-\sum_i P_A(x_i) log P_A(x_i)$
KL散度：可以用来表示从事件A的角度出发，事件B有多大不同

$D_{KL}(A||B) = \sum_i P_A log(\frac {P_A(x_i)}{P_B(x_i)}) = \sum_i P_A(x_i) log(P_A(x_i))-\sum_i P_A(x_i) log(P_B(x_i))$

交叉熵：可以用来表示从事件A的角度出发，如何描述事件B

$H(A, B)=- \sum_i P(x_i)log(P_B(x_i))$ $D_{KL}(A||B) = -S(A + H(A, B)$

KL散度可以被用于计算代价，而在特定的情况下，min(KL) <=> min(cross_entropy). 而交叉熵的运算更简单，所以用交叉熵来做代价函数

模型学到的分布 $P(model)$

训练数据的分布 $P(train)$

真实数据的分布 $P(real)$

为什么交叉熵可以作为代价？

$P(model)$ ~ $P(train)$ => $KL(P(train)||P(model))$

此处 $P(train)$ 就是事件A， $P(model)$ 就是事件B

恰好，训练数据的分布是给定的、已知的，所以求 $D_{KL}(A||B)$ <=> 求 $H(A, B)$

Window Based NER

这是一种根据上下文给单个单词分类的任务，可以用于消歧或命名实体分类。

“Spokesperson for Levis, Bill Murray, said . . . ”
where it is ambiguous whether Levis is a person or an organization.
“Heartbreak is a new virus,” where “Heartbreak” could either be a MISC named entity
(it’s actually the name of a virus), or simply a noun.

Basically, we are looking for any situation wherein the sentence contains a word

softmax classifier & cross_entropy loss

训练了一个softmax 分类器，其中 x 是一个窗口中中心词+上下文的向量拼接，y 是中心词的类别，仍然使用交叉验证熵作为我们的损失函数

这里 $C$ 指的是多类别的类别数，$W$ 是一个 $C\times d$ 的矩阵，$d$ 指的是向量 $X_{window}$ 的维度，注意这里 $X_{window}$ 是把 5 个单词的词向量 contact 了起来，如图所示
是一个 $(5\times 4)\times 1=20\times 1$ 的向量。$W_y$ 指的是 $W$ 第 $y$ 行的向量，$J(\theta)$ 中的参数 $\theta$ 便
是 $W$ 中的各个元素。我们的目标便是最小化这个损失函数。

但是一个广义线性的模型，不管是思想上还是模型本身都比较简单，因此我们希望找到更加复杂的模型来学习词语之间的关系，甚至学习语义，由此我们便想到了神经网络。

Neural Network & max-margin loss

NN

相比于softmax的线性决策，NN的非线性拟合会学习词和词之间的interaction。

仍以 museums in Paris are amazing 为例,在此之前，当 $in$ 出现时，它的下一个词大概率会是一个地名，但是现在，我们可以通过 NN 学习到 things and patterns ，就是说，如果 $in$ 出现在窗口第二词的位置上，当且仅当第一个词是 $museum$ 时， $in$ 后面的词是地名的概率才会增大。

一个简单的网络：

这种红点图经常在论文里看到，大致代表单元数；中间的空格分隔开一组神经元，比如隐藏层单元数为2×4。

U 是隐藏层到class的权值矩阵

$\alpha$ 是激活函数

max-margin

怎么设计目标函数呢，记$s_c$代表误分类样本的得分，$s$表示正确分类样本的得分。则朴素的思路是最大化$(s−s_c)$ 或最小化 $(s_c−s)$。但有种方法只计算$ s_c>s⇒(s_c−s)>0$ 时的错误，也就是说我们只要求正确分类的得分高于错误分类的得分即可，并不要求错误分类的得分多么多么小。这得到间隔最大化目标函数:

$min(J)=max(s_c-s,0)$

但上述目标函数要求太低，风险太大了，没有留出足够的“缓冲区域”。可以指定该间隔的宽度 $s−s_c< \delta$ ，得到：

$min(J)=max(s_c-s + \delta,0)$

可以令 $\delta = 1$

在这个分类问题中，这两个得分的计算方式为：$s_c=U^Tf(Wxc+b)$ 和 $s=U^Tf(Wx+b)$；通常通过负采样算法得到负例。

另外，这个目标函数的好处是，随着训练的进行，可以忽略越来越多的实例，而只专注于那些难分类的实例。

参考资料

nltk-ner

高级词向量表示

Posted on 2018-05-22 | In 公开课

这节课从传统的、基于计数的全局方法出发，过渡到高级的Glove，并介绍了词向量的调参与评测方法
词幂律分布，log之后是线性的

复习：word2vec（skip-gram）

遍历整个语料库中的每个词
预测每个词的上下文
然后在每个窗口（最多有2m+1个词）中做SGD，

word2vec 的 Q

整个矩阵都需要更新吗？
所有词表中的词都需要进行计算吗？
只能在局部上下文窗口训练模型吗？

word2vec 的 A

每个窗口相对于整个字典D来说是稀疏的，所以求导也是稀疏的。

而最终需要去和正确答案比对的，只是这些窗口中出现的词，所以每次更新的也只是W矩阵中的少数列，

此外，词表V的量级非常大，以至于下式的分母很难计算。所以提出了negative sampling的方法，这是一种采样子集简化运算的方法。具体做法是，对每个正例（中央词语及上下文中的一个词语）采样几个负例（中央词语和其他随机词语），训练一个二分类器。
目标函数是

即对于每一个窗口进行k次采样，其中负采样算法的基本思路是对于长度为1的线段，根据词语的词频将其公平地分配给每个词语。

也就是满足“保证频次越高的样本越容易被采样出来”

接下来只要生成一个0-1之间的随机数，看看落到哪个区间，就能采样到该区间对应的单词了，很公平。
为了快速找到小数对应的区间，word2vec用的是一种查表的方式，将上述线段标上M个“刻度”，刻度之间的间隔是相等的，即1/M：

这样就不生成0-1之间的随机数了，只要生成0-M之间的整数，去这个刻度尺上一查就能抽中一个单词了。
在word2vec中，该“刻度尺”对应着table数组。具体实现时，对词频取了0.75次幂：

这个幂实际上是一种“平滑”策略，能够让低频词多一些出场机会，高频词贡献一些出场机会。

在word2vec模型提出不久，Jeffrey Pennington等人认为虽然skip-gram模型在计算近义词方面比较出色，但它们只是在局部上下文窗口训练模型，并且它很少使用语料中的一些统计信息，因此Jeffrey Pennington等人又提出了一个新型模型GloVe。

Glove 模型

参数说明

(1）词-词共现计数矩阵可以表示为 $X$

(2) 单词 j 在单词 i 中上下文的次数 $X_{ij}$

(3) 表示任何词出现在单词i上下文中的次数表示为 $X_{i}=\sum_kX_{ik}$

(4) 单词j出现在单词i上下文中的比率可表示为 $P_{ij}$ ,即(2)/(3)

举个例子：说明是怎样从共现概率中抽取确定的意思，其实也就是说，和哪个上下文单词在一起的多，那么这个单词与这个上下文单词在一起要比与其他词在一起意义要大。例如$i=ice$, $j=steam$，假设有共现词$k$，但是$k$与$ice$的联系要比与$steam$的联系强，也就是说单词$k$与$ice$出现的概率比与$steam$出现的概率大，比如说$k=solid$，那么认为 $P_{ik} \over P_{jk}$ 会很大。相似地，如果单词$k$与$steam$的联系比与$ice$的联系强，例如$k=gas$，那么$P_{ik} \over P_{jk}$ 的比率会很小，对于其他的单词$k$如$water, fashion$与$ice,steam$联系都强或都不强的话，则 $P_{ik} \over P_{jk}$ 的比率会接近1。那么这个比率就能区别相关词 $(solid, gas)$ 和不相关词 $(water, fashion)$ ，并且也能区别这两个相关的词 $(solid, gas)$ 。那么得到的向量可能为 $ice-steam=solid-gas$ ，这与word2vec相似。
模型训练

（1）对于词向量的学习开始于共现概率的比率 $P_{ik} \over P_{jk}$，最一般的形式为

其中$w$是词向量， $\tilde{w}$ 是上下文单词向量
由于向量空间是线性的，所以可以把两个向量表示为向量的差，则可以把上式写成：

由于等式右边是一个标量，则等式左边参数可以表示为点积形式：

由于一个单词与其上下文单词的区别是任意的，它们两个的角色是可以互换的，所以单词 $w$ 可与 $\tilde{w}$ 互换，并且矩阵 $X$ 与 $X^{T}$ 也可以互换，但对于最终结果都是不变的。然后把原始概率用 $F$ 函数表示，则:

其中， $F(w_i^{T}\tilde{w}) = P_{ik}$, $P_{ik}$ = $X_{ik} \over X_{i}$.当 $F=exp$ 时，上式为：

因为多存在了一个 $log(X_i)$，所以这个式子不具有对称性，我们注意到，$log(X_i)$ 与 k无关，所以把这一项放到了对 $w_i$ 偏置的 $b_i$中，最后再加上对 $\tilde{w}_{k}$ 的偏置 $\tilde{b}_{k}$，则：

这是$F(w_i, w_j, \tilde{w}_k)=P_{ik}/P_{jk}$的简化，并具有对称性。当 $X_{ik}$ 接近于0时对数是发散的，所以可以把 $log(X_{ik})$ 转化为 $log(1+X_{ik})$

该模型的一个主要缺点为它认为所有的共现词权重相等，即使很少出现或没有出现的词，所以为了克服这一缺点，使用了加权最小二乘回归模型：

不同模型的对比

Omer Levy等人对基于计数的方法和基于embedding的方法做了对比，发现它们之间并没有非常大的差距，在不同的场景各个模型发挥不同的作用，它们之间并没有谁一定优于谁，相比于算法而言，增加语料量，进行预处理以及超参数的调整显得非常重要。特别指出，基于negtive sampling的skip-gram模型可以作为一个基准，尽管对不同的任务它可能不是最好的，但是它训练快，占用内存和磁盘空间少。

word2vec-实现篇

Posted on 2018-05-16 | In 公开课

有了原理篇做基础，这次来进行一个简单的实战，从最简单的上下文只有一个词的情形入手, 利用朴素softmax和用梯度下降，多轮迭代进行模型训练。

模型结构

首先先看简化版入手: 输入输出都只有一个词,如下图示:

首先说明符号:

$x$: 输入是一个词，这个词用 one-hot 编码方式表示成的向量，维度等于词表长度 $V$, 只有输入词编号位置是 1，其他都是 0;
$V$: 词汇表长度;
$N$: 隐层神经元个数, 同时也是词向量维度;
$W \in R^{V \times N}$, 输入层到隐层的权重矩阵, 其实就是词向量矩阵,其中每一行代表一个词的词向量(称之为输入向量，也是 word2vec 的最终产出);
$W’ \in R^{N \times V}$,隐层到输出层的权重矩阵, 其中每一列也可以看作额外的一种词向量;

然后，因为是神经网络，所以中间有一个隐藏层 $h$，和通常的神经网络(深度神经网络更是如此)不同，这个隐藏层没有非线性激活函数。$h$ 有 $N$ 个神经元，这个 $N$ 就是最终给每个词学习到的向量长度，是一个超参数，所谓超参数就是开始计算只要需要人指定的参数。隐藏层的输出就是 $H_{N\times 1}=W_{V \times N}^{T} \times X_{V \times 1}$

$y$: 每个词的概率.其实是一个多项分布.

其中 $v_w$ 与$v_{w}$ 都称为词 $w$ 的词向量，一般使用前者作为词向量，而非后者。至此前向过程完成，就是给定一个词作为输入，来预测它的上下文词，还是比较简单的，属于简化版的神经语言模型。

训练及更新

首先明确训练数据的格式，对于一个训练样本($w_I$, $w_O$)，输入是词 $w_I$ 的 one-hot 的维度为 $V$ 的向量 $x$，模型预测的输出同样也是一个维度为$V$的向量y, 同时真实值 $w_O$ 也是用one-hot表示，记为 $t=[0,0,0,…1,0,0]$，其中假设 $t_{j^∗}=1$, 也就是说 $j^∗$ 是真实单词在词汇表中的下标，那么根据最大似然或者上面的语言模型，目标函数可以定义如下:

一般我们习惯于最小化损失函数，因此定义损失函数:

然后结合反向传播一层层求梯度，使用梯度下降来更新参数。

更新输出矩阵

先求隐层到输出层的向量矩阵W’的梯度:

这里面的 $y_j$和 $t_j$ 分别是预测和真实值的第$j$项，$h_i$是隐层的第 $i$ 项。

因此梯度下降更新公式为:

整合为W′的列向量 $v^{‘}_{w_j} ={w^{‘}_{ij}|i=1,2,3,…,N}$ 的形式如下:

也就是说对每个训练样本都需要做一次复杂度为 $V$ 的操作去更新 $W^{‘}$

更新输入矩阵

接着考虑隐层 $h$ 的更新，其实也是输入层到隐层的矩阵 $W$ 的更新，继续反向传播，跟神经网络的相同, 输出层的 $V$ 个神经元都会影响$h_i$:

其中$W^{‘}_i$是 $W′$ 的第i行, 这里为了方便书写,令 $P$={$y_j−t_j|j=1,2,3,..,V$}, 因此整合成整个隐层的向量$h$:

得到一个$N$维的向量，上面已经介绍过，$h$ 就是词向量矩阵$W$的一行: $W^T X = v_{w_I}^T$.

如果对这一段的推导印象模糊，可以参考反向传播与梯度计算

对应的矩阵运算表示就是：

$\frac{\partial E}{\partial H}={W^{'}_{N \times V}}P$

又因为

$\frac{\partial H}{\partial W}=X_{V \times 1}$

综合起来，输入矩阵的梯度计算就是：

$\frac{\partial E}{\partial W}= {( \frac{\partial H}{\partial W} )}_{v \times 1} {(\frac{\partial E}{\partial H})}_{N \times 1}^{T}$

得到梯度后，每次就可以按照如下公式迭代更新 $W_{V\times N}$：

$W_{V\times N}=W_{V\times N}-\eta \frac{\partial E}{\partial W}$

训练过程

迭代之前，随机初始化两个矩阵，然后每轮迭代过程就是三步:

计算输出概率；
计算输出矩阵的梯度，更新输出矩阵参数；
计算输入矩阵的梯度，更新输入矩阵参数。

我们把原始语料库拆解成词对: (左边词，当前词) (右边词，当前词) 然后构造成训练样本。

python 实现

# coding: utf-8

import numpy as np
import scipy as sp
import argparse
import math
import sys
import operator


class FeatureIndex():
    def __init__(self):
        self.word_to_index = {}
        self.index_to_word = {}

    def save(self, fout):
        with open(fout, 'w') as fw:
            for word, idx in self.vocab.iteritems():
                output = '\t'.join([word, str(idx)])
                fw.writelines(output+'\n')
    def read_word_idx(self, fin):
        with open(fin, 'r') as fr:
            lines = fr.readlines()
            for line in lines:
                line = line.strip().split("\t")
                self.index_to_word[line[1]] = line[0]
    def read_word(self, fn):
        self.word_to_index = self.vocab
        uniq_words = set()
        with open('./data/data_note2.txt', 'r') as fr:
            lines = fr.readlines()
            for line in lines:
                words = line.strip().split()
                for word in words:
                    uniq_words.add(word)
        for idx, word in enumerate(uniq_words):
            self.word_to_index[word] = idx

def read_corpus(corpusfile, window = 1):
    vocabulary = FeatureIndex()
    traindata = []
    with open(corpusfile) as corpusinput:
        for line in corpusinput:
            words = line.strip().split()
            wordid = []
            for word in words:
                index = vocabulary.word_to_index[word]
                if index >= 0:
                    wordid.append(index)
                lastword = -1
            for i in range(len(wordid)):
                # 以当前词作为预测目标，前一个词作为输入构造一条样本
                if i > 0:
                    traindata.append((wordid[i - 1], wordid[i]))
                    # 以当前词作为预测目标，下一个词作为输入构造一条样本
                if i < len(wordid) - 1:
                    traindata.append((wordid[i + 1], wordid[i]))
    return vocabulary, traindata

def init_vectors(V, N): 
    """
        Argvs:
            V: Int, dims of the vocabulary
            N: Out, dims of the word vector
    """
    input_vectors = np.random.rand(V, N)
    output_vectors = np.random.rand(N, V)
    return input_vectors, output_vectors

def save_vectors(input_vectors, vectorfile):
    """ save vectors info file, each row is a series float seperate by space
        Argvs:
            input_vectors: A numpy array, each row is a distribution represatation of a word 
            vectorfile: A string, file name
    """
    
    with open(vectorfile, 'w') as vectorfileoutput:
        for row in range(input_vectors.shape[0]):
            vectorfileoutput.write("%s\n" % (" ".join([str(s) for s in input_vectors[row]])))

def compute_loss(V, traindata, input_vectors, output_vectors):
    """
        Argvs:
            V: Int, dims of the vocabulary
            traindata: List of pair(cur_idx, next_idx), (last_idx, cur_idx)
            input_vectors: Numpy Array with shape (V, N)
            output_vectors: Numpy Array with shape (N, V)
    """
    loss = 0.0
    for pair in traindata:
        context_word_onehot = np.zeros((V, 1)) # (V, 1)
        context_word_onehot[pair[0]] = 1
        output_word_onehot = np.zeros((V, 1)) # (V, 1)
        output_word_onehot[pair[1]] = 1

        hidden_output = np.dot(np.transpose(input_vectors), context_word_onehot) # (V, N)
        output_log_prob = np.dot(np.transpose(output_vectors), hidden_output) #(V, 1)
        exp_log_prob = np.exp(output_log_prob)
        sum_exp_prob = np.sum(exp_log_prob)
        output_word_prob = exp_log_prob / sum_exp_prob #(V, 1)
        prob = output_word_prob[pair[1]]
        loss += math.log(1.0/prob)
    return loss



def run_word2vec(corpusfile, vectorfile, vocabulary_file, N=5, learning_rate=0.05, epoch=100):
    """ Get word distribution presatation
        Argvs:
            corpusfile:  words have been splited by space
            vectorfile: file to save vectors
            vocabulary_file: file to save words
            N: Int, vector's dimention, default 5
            learning_rate: Float, defalut 0.05
            epoch: Float, iteration times default 100
    """

    vocabulary, traindata = read_corpus(corpusfile)
    input_vectors, output_vectors = init_vectors(len(vocabulary.vocab), N)
    for t in range(epoch):
        print('epoch = %s, loss = %f' % (t, compute_loss(len(vocabulary.vocab), traindata, input_vectors, output_vectors)))
        for pair in traindata:
            # predict
            context_word_onehot = np.zeros((len(vocabulary.vocab), 1))
            context_word_onehot[pair[0]] = 1
            output_word_onehot = np.zeros((len(vocabulary.vocab), 1))
            output_word_onehot[pair[1]] = 1
            hidden_output = np.dot(np.transpose(input_vectors), context_word_onehot)
            output_log_prob = np.dot(np.transpose(output_vectors), hidden_output)
            exp_log_prob = np.exp(output_log_prob)
            sum_exp_prob = np.sum(exp_log_prob)
            output_word_prob = exp_log_prob / sum_exp_prob

            d_output_log_prob = output_word_prob - output_word_onehot
            d_output_vectors = np.dot(hidden_output, np.transpose(d_output_log_prob))
            output_vectors = output_vectors - learning_rate * d_output_vectors

            d_hidden_to_output_log_prob = output_vectors
            d_hidden_output = np.dot(d_hidden_to_output_log_prob, d_output_log_prob)
            d_input_vectors = np.dot(context_word_onehot, np.transpose(d_hidden_output))
            input_vectors = input_vectors - learning_rate * d_input_vectors
    # save
    save_vectors(input_vectors, vectorfile)
    vocabulary.save(vocabulary_file)

简单测试

构造了一个简单的语料库

吃 米饭 喝 啤酒
喝 果汁
吃 面条
面条 吃
米饭 吃
果汁 喝
啤酒 喝
喝 汽水
饮 啤酒
喝 白酒
干 二锅头
干 啤酒
吃 馒头
吃 大米
品 茶
品 白酒
品 二锅头
喝 茶
喂 米饭
煮 米饭
煮 茶

训练模型:

1	python word2vec-simple.py

写一个程序用来查找相似词:

# -*- coding: utf-8 -*-

import numpy as np
from scipy.spatial.distance import cosine
import argparse
import math
import sys
import operator

class FeatureIndex():
    def __init__(self):
        self.word_to_index = {}
        self.index_to_word = {}

    def read_word_idx(self, fn):
        with open(fn, 'r') as fr:
            lines = fr.readlines()
            for line in lines:
                line = line.strip().split("\t")
                self.index_to_word[line[1]] = line[0]
                self.word_to_index[line[0]] = line[1]


class VectorIndex:
    def __init__(self):
        self.idx_vec_pair = {}
    def read_vectors(self, vectorfile):
        with open(vectorfile) as vectorfile_input:
            wordid = 0
            for line in vectorfile_input:
                vector = [float(s) for s in line.strip().split()]
                self.idx_vec_pair[str(wordid)] = vector
                wordid += 1

    def get_topK_similar(self, wordid, k):
        """与查找词最近的k个词"""
        vec_input = self.idx_vec_pair[str(wordid)]
        scores = {}
        for wordid in self.idx_vec_pair:
            vec_curr = self.idx_vec_pair[wordid]
            score = cosine(vec_input, vec_curr)
            scores[wordid] = score
        scores = sorted(scores.items(), key=operator.itemgetter(1))
        k = min(k+1, len(scores))
        return scores[:k]


def main(vectorfile, vocabularyfile):
    vocabulary = FeatureIndex()
    vocabulary.read_word_idx(vocabularyfile)
    vector_index = VectorIndex()
    vector_index.read_vectors(vectorfile)

    while True:
        print("please input a word in vocabulary:") 
        line = sys.stdin.readline()
        line = line.strip()
        if line == 'exit' or line == 'quit':
            break
        wordid = vocabulary.word_to_index[line]
        if wordid is None:
            print('sorry, the word you input is not in our corpus.')
        else:     
            print("your word's id is %s" % wordid)
            result = vector_index.get_topK_similar(wordid, 5)
            print("similar words:")
            similar_words = [vocabulary.index_to_word[r[0]] for r in result]
            for i in range(len(similar_words)):
                print("\t%s:\t%s" % (similar_words[i], result[i][1]))


if __name__ == '__main__':
    parser = argparse.ArgumentParser(description = 'query similar words from vectors')
    parser.add_argument('-v', '--vector', help = """vector file from word2vec""") 
    parser.add_argument('-w', '--words', help = """all words and its id(vocabulary)""")
    args = vars(parser.parse_args())
    if args['vector'] is None or args['words'] is None:
        parser.print_help()
        exit()

    main(args['vector'], args['words'])

word2vec-原理篇

Posted on 2018-05-14 | In 公开课

词语的表示

discrete representation (similar to symbolic )
symbolic representations(localist representation, one-hot representation)
distributed representations
Distributional similarity based representations

两种产生词向量的方法

Skip-grams(SG): 预测上下文
Continuous Bag of Words（CBOW）：预测中心词

三种模型训练方法

3.1 朴素 softmax 法

计算一个词向量预测另一个词向量的概率,一种朴素的思想就是计算二者的相似度.
假设 $v_{c}$ 代表中心词, $u_{o}$ 代表其窗口中的一个词,则可以根据softmax的思想定义:

概率值定义：

注意:这里W指的是语料字中所有的词量.

训练方法：SGD进行参数更新，但是并不像 BGD 那样对整个目标函数进行梯度下降，而是过滤一个窗口就进行一次更新
缺点:计算量大(每个窗口都进行更新);模型过于简单(仅仅利用两个词的相似度代表二者的预测概率);训练结束后，每个词对应于两个向量

3.2 Hierarchical softmax 法（Huffman树）

huffman树最后导致的结果就是出现频率越高的词所在的子节点越靠近根部,词频出现的越低的词越靠近底部.

概率值定义：

假设这颗树有n个非叶子节点(标黄的部分) $R_i (i=1,…,n)$ ,
每个点处都有一个参数$z_{i}$,其维度和词向量相同.每个节点都是一个类似LR的分类概率

当中心词确定的时候，$v$ 就确定了，只有 $z$ 是要更新的参数，但最终的目标是求得最优的 $v$

所以对每个窗口对应的

训练方法：SGD，
优点：模型复杂度低；预测概率更有解释性；训练完成后每个向量都对应到一个唯一的向量上

Huffman Tree

#Huffman Encoding

#Tree-Node Type
class Node:
    def __init__(self,freq):
        self.left = None
        self.right = None
        self.father = None
        self.freq = freq

    def isLeft(self):
        return self.father.left == self

#create nodes创建叶子节点
def createNodes(freqs):
    return [Node(freq) for freq in freqs]

#create Huffman-Tree创建Huffman树
def createHuffmanTree(nodes):
    queue = nodes[:]
    while len(queue) > 1:
        queue.sort(key=lambda item:item.freq)
        node_left = queue.pop(0)
        node_right = queue.pop(0)
        node_father = Node(node_left.freq + node_right.freq)
        node_father.left = node_left
        node_father.right = node_right
        node_left.father = node_father
        node_right.father = node_father
        queue.append(node_father)
    queue[0].father = None
    return queue[0]


#Huffman编码
def huffmanEncoding(nodes,root):
    codes = [''] * len(nodes)
    for i in range(len(nodes)):
        node_tmp = nodes[i]
        while node_tmp != root:
            if node_tmp.isLeft():
                codes[i] = '0' + codes[i]
            else:
                codes[i] = '1' + codes[i]
            node_tmp = node_tmp.father
    return codes

if __name__ == '__main__':

    chars_freqs = [('我', 15), ('喜欢', 8), ('看', 6), ('巴西', 5), ('足球', 3), ('世界杯', 1)]
    nodes = createNodes([item[1] for item in chars_freqs])
    root = createHuffmanTree(nodes)
    codes = huffmanEncoding(nodes,root)
    for item in zip(chars_freqs,codes):
        print 'Character:%s freq:%-2d   encoding: %s' % (item[0][0],item[0][1],item[1])

3.3 Negative sampling 法

与 Hierarchical 法相比, 负采样法不再使用复杂的 huffman 树,而是采用随机负采样,这样的计算效率会得到进一步提升.

概率值定义：

其中，

这里参数可以理解为每个词都对应了一个theta。当其作为正，负训练样本时便会得到更新。
目标函数：此时的目标函数形式和 Hierarchical 的完全类似,唯一的区别就是 Hier 的概率表达式需要通过建立 huffman 树才能得到,而 NEG 法需要预先进行负采样.
负采样具体是如何进行的?就是最简单的线性加权采样的思想.比如说”我”出现3次,”你”出现1次,那么采样到”我”的概率为3/4.一般负采样集的元素个数为3个左右.

参数更新

GD Batch 过所有窗口进行更新（BGD 指向最合适的，仅适用于凸函数）
SGD 过一个窗口进行更新（折线走向，更新一次不需要那么久，不限于凸函数）