依存分析 | 小尚是个好青年

背景

什么是 dependency parsing

通过分析语言单位内成分之间的依存关系揭示其句法结构。即识别句子中的“主谓宾”、“定状补”这些语法成分，并分析各成分之间关系。
句法分析(symentic parsing)= 句法结构分析（syntactic structure parsing）+ 依存关系分析(dependency parsing)。
成分结构分析（constituent structure parsing）以获取整个句子的句法结构或者完全短语结构为目的的句法分析，又称短语结构分析（phrase structure parsing）；
依存分析（dependency parsing）以获取局部成分为目的的句法分析。

目前的句法分析已经从句法结构分析转向依存句法分析，一是因为通用数据集Treebank（Universal Dependencies treebanks）的发展，虽然该数据集的标注较为复杂，但是其标注结果可以用作多种任务（命名体识别或词性标注）且作为不同任务的评估数据，因而得到越来越多的应用，二是句法结构分析的语法集是由固定的语法集组成，较为固定和呆板；三是依存句法分析树标注简单且parser准确率高。

有哪些应用

主要是句法分析的应用，如机器翻译、信息检索与抽取、问答系统、语音识别。
在项目中使用的：NER（挖掘新词）

如何描述语法

有两种主流观点，其中一种是短语结构文法，英文术语是：Constituency = phrase structure grammar = context-free grammars (CFGs)。
这种短语语法用固定数量的rule分解句子为短语和单词、分解短语为更短的短语或单词……
从 none phrase 开始，后面可能会有 adjective none、prepositiontial phrase（可能又依赖于其他名词短语），

另一种是依存结构，用单词之间的依存关系来表达语法。如果一个单词修饰另一个单词，则称该单词依赖于另一个单词

细节

人们画依存句法树的弧的方式不同，这门课是head指向dependent（即箭头指向的词语是依赖者，箭头尾部的词语是被依赖者）。

每个句子都有一个虚根，代表句子之外的开始，这样句子中的每个单词都有自己的依存对象了。

defination

可用的特征

双词汇亲和（Bilexical affinities），比如discussion与issues。
词语间距，因为一般相邻的词语才具有依存关系
中间词语，如果中间词语是动词或标点，则两边的词语不太可能有依存
词语配价，一个词语最多有几个依赖者。

约束

ROOT只能被一个词依赖。
无环。

方法

Dynamic programming

估计是找出以某head结尾的字串对应的最可能的句法树。

Graph algorithms

最小生成树。

Constraint Satisfaction

估计是在某个图上逐步删除不符合要求的边，直到成为一棵树。

“Transition-based parsing” or “deterministic dependency parsing”

greedy choice of attachments guided by ml clf.

主流方法，基于贪心决策动作拼装句法树。
defination

step 1: build up structrue
step 2: train a classifier

已经有了 a tree bank with parses of sentences, 就根据 tree bank 来确定 which sequence of operations would give the correct parse of a sentence （transition的分类问题）

step 3: evaluation

evl

Unlabeled attachment score(UAS)= head

Labeled attachment score(LAS) = head and label

简单实现

import sys
import cPickle
import os
import time
import tensorflow as tf
import numpy as np
from model import Model
from utils.parser_utils import minibatches, load_and_preprocess_data

def xavier_weight_init():
    """Returns function that creates random tensor.

    The specified function will take in a shape (tuple or 1-d array) and
    returns a random tensor of the specified shape drawn from the
    Xavier initialization distribution.

    Hint: You might find tf.random_uniform useful.
    """
    def _xavier_initializer(shape, **kwargs):
        """Defines an initializer for the Xavier distribution.
        Specifically, the output should be sampled uniformly from [-epsilon, epsilon] where
            epsilon = sqrt(6) / <sum of the sizes of shape's dimensions>
        e.g., if shape = (2, 3), epsilon = sqrt(6 / (2 + 3))

        This function will be used as a variable initializer.

        Args:
            shape: Tuple or 1-d array that species the dimensions of the requested tensor.
        Returns:
            out: tf.Tensor of specified shape sampled from the Xavier distribution.
        """
        epsilon = np.sqrt(6 / np.sum(shape))
        out_t = tf.Variable(tf.random_uniform(shape=shape, minval=-epsilon, maxval=epsilon))
        return out_t
    return _xavier_initializer

class Config(object):
    """Holds model hyperparams and data information.

    The config class is used to store various hyperparameters and dataset
    information parameters. Model objects are passed a Config() object at
    instantiation. They can then call self.config.<hyperparameter_name> to
    get the hyperparameter settings.
    """
    n_features = 36
    n_classes = 3
    dropout = 0.5  # (p_drop in the handout)
    embed_size = 50
    hidden_size = 200
    batch_size = 1024
    n_epochs = 10
    lr = 0.0005

class ParserModel(Model):
    """
    Implements a feedforward neural network with an embedding layer and single hidden layer.
    This network will predict which transition should be applied to a given partial parse
    configuration.
    """

    def add_placeholders(self):
        """Generates placeholder variables to represent the input tensors"""
        self.input_placeholder = tf.placeholder(tf.int32, [None, self.config.n_features])
        self.labels_placeholder = tf.placeholder(tf.float32, [None, self.config.n_classes])
        self.droupout_placeholder = tf.placeholder(tf.float32)
        
    def create_feed_dict(self, inputs_batch, labels_batch=None, dropout=0):
        """Creates the feed_dict for the dependency parser.

        A feed_dict takes the form of:

        feed_dict = {
                <placeholder>: <tensor of values to be passed for placeholder>,
                ....
        }


        Hint: The keys for the feed_dict should be a subset of the placeholder tensors created in add_placeholders.
        Hint: When an argument is None, don't add it to the feed_dict.

        Args:
            inputs_batch: A batch of input data.
            labels_batch: A batch of label data.
            dropout: The dropout rate.
        Returns:
            feed_dict: The feed dictionary mapping from placeholders to values.
        """
        feed_dict = {self.input_placeholder: inputs_batch, \
                    self.droupout_placeholder: dropout}
        if labels_batch.any():
            feed_dict[self.labels_placeholder] = labels_batch
        return feed_dict
    
    def add_embedding(self):
        """Adds an embedding layer that maps from input tokens (integers) to vectors and then
        concatenates those vectors:
            - Creates a tf.Variable and initializes it with self.pretrained_embeddings.
            - Uses the input_placeholder to index into the embeddings tensor, resulting in a
              tensor of shape (None, n_features, embedding_size).
            - Concatenates the embeddings by reshaping the embeddings tensor to shape
              (None, n_features * embedding_size).

        Hint: You might find tf.nn.embedding_lookup useful.
        Hint: You can use tf.reshape to concatenate the vectors. 

        Returns:
            embeddings: tf.Tensor of shape (None, n_features*embed_size)
        """
        embedding = tf.Variable(self.pretrained_embeddings)
        embeddings = tf.nn.embedding_lookup(embedding, self.input_placeholder)
        embeddings = tf.reshape(embeddings, [-1, self.config.n_features * self.config.embed_size])
        return embeddings    
    
    def add_prediction_op(self):
        """Adds the 1-hidden-layer NN:
            h = Relu(xW + b1)
            h_drop = Dropout(h, dropout_rate)
            pred = h_dropU + b2

        Note that we are not applying a softmax to pred. The softmax will instead be done in
        the add_loss_op function, which improves efficiency because we can use
        tf.nn.softmax_cross_entropy_with_logits

        Use the initializer from q2_initialization.py to initialize W and U (you can initialize b1
        and b2 with zeros)

        Hint: Note that tf.nn.dropout takes the keep probability (1 - p_drop) as an argument.
              Therefore the keep probability should be set to the value of
              (1 - self.dropout_placeholder)

        Returns:
            pred: tf.Tensor of shape (batch_size, n_classes)
        """
        
        x = self.add_embedding()
        xavier = xavier_weight_init()
        with tf.variable_scope("transformation"):
            b1 = tf.Variable(tf.random_uniform([self.config.hidden_size]))
            b2 = tf.Variable(tf.random_uniform([self.config.n_classes]))
            
            W = xavier([self.config.n_features * self.config.embed_size, self.config.hidden_size])
            U = xavier([self.config.hidden_size, self.config.n_classes])
            
            z1 = tf.matmul(x, W) + b1
            h = tf.nn.relu(z1)
            h_dropout = tf.nn.dropout(h, keep_prob=(1-self.droupout_placeholder))
            pred = tf.matmul(h_dropout, U) + b2
        return pred
    
    def add_loss_op(self, pred):
        """Adds Ops for the loss function to the computational graph.
        In this case we are using cross entropy loss.
        The loss should be averaged over all examples in the current minibatch.

        Hint: You can use tf.nn.softmax_cross_entropy_with_logits to simplify your
                    implementation. You might find tf.reduce_mean useful.
        Args:
            pred: A tensor of shape (batch_size, n_classes) containing the output of the neural
                  network before the softmax layer.
        Returns:
            loss: A 0-d tensor (scalar)
        """
        loss = tf.nn.softmax_cross_entropy_with_logits(logits=pred, labels=self.labels_placeholder)
        loss = tf.reduce_mean(loss)
        return loss
    
    def add_training_op(self, loss):
        adam_optim = tf.train.AdamOptimizer(self.config.lr)
        train_op = adam_optim.minimize(loss)
        return train_op
    
    def train_on_batch(self, sess, inputs_batch, labels_batch):
        feed = self.create_feed_dict(inputs_batch, labels_batch=labels_batch,
                                     dropout=self.config.dropout)
        _, loss = sess.run([self.train_op, self.loss], feed_dict=feed)
        return loss
    
    def run_epoch(self, sess, parser, train_examples, dev_set):
        n_minibatches = 1 + len(train_examples) / self.config.batch_size
        prog = tf.keras.utils.Progbar(target=n_minibatches)
        for i, (train_x, train_y) in enumerate(minibatches(train_examples, self.config.batch_size)):
            loss = self.train_on_batch(sess, train_x, train_y)
            prog.update(i + 1, [("train loss", loss)])

        print "Evaluating on dev set",
        dev_UAS, _ = parser.parse(dev_set)
        print "- dev UAS: {:.2f}".format(dev_UAS * 100.0)
        return dev_UAS
    
    def fit(self, sess, saver, parser, train_examples, dev_set):
        best_dev_UAS = 0
        for epoch in range(self.config.n_epochs):
            print "Epoch {:} out of {:}".format(epoch + 1, self.config.n_epochs)
            dev_UAS = self.run_epoch(sess, parser, train_examples, dev_set)
            if dev_UAS > best_dev_UAS:
                best_dev_UAS = dev_UAS
                if saver:
                    print "New best dev UAS! Saving model in ./data/weights/parser.weights"
                    saver.save(sess, './data/weights/parser.weights')
            print
            
    def __init__(self, config, pretrained_embeddings):
        self.pretrained_embeddings = pretrained_embeddings
        self.config = config
        self.build()   

def main(debug=True):
    print 80 * "="
    print "INITIALIZING"
    print 80 * "="
    config = Config()
    parser, embeddings, train_examples, dev_set, test_set = load_and_preprocess_data(debug)
    if not os.path.exists('./data/weights/'):
        os.makedirs('./data/weights/')

    with tf.Graph().as_default() as graph:
        print "Building model...",
        start = time.time()
        model = ParserModel(config, embeddings)
        parser.model = model
        init_op = tf.global_variables_initializer()
        saver = None if debug else tf.train.Saver()
        print "took {:.2f} seconds\n".format(time.time() - start)
    graph.finalize()

    with tf.Session(graph=graph) as session:
        parser.session = session
        session.run(init_op)

        print 80 * "="
        print "TRAINING"
        print 80 * "="
        model.fit(session, saver, parser, train_examples, dev_set)
        
        if not debug:
            print 80 * "="
            print "TESTING"
            print 80 * "="
            print "Restoring the best model weights found on the dev set"
            saver.restore(session, './data/weights/parser.weights')
            print "Final evaluation on test set",
            UAS, dependencies = parser.parse(test_set)
            print "- test UAS: {:.2f}".format(UAS * 100.0)
            print "Writing predictions"
            with open('q2_test.predicted.pkl', 'w') as f:
                cPickle.dump(dependencies, f, -1)
            print "Done!"
class PartialParse(object):
    def __init__(self, sentence):
        """Initializes this partial parse.

        Your code should initialize the following fields:
            self.stack: The current stack represented as a list with the top of the stack as the
                        last element of the list.
            self.buffer: The current buffer represented as a list with the first item on the
                         buffer as the first item of the list
            self.dependencies: The list of dependencies produced so far. Represented as a list of
                    tuples where each tuple is of the form (head, dependent).
                    Order for this list doesn't matter.

        The root token should be represented with the string "ROOT"

        Args:
            sentence: The sentence to be parsed as a list of words.
                      Your code should not modify the sentence.
        """
        # The sentence being parsed is kept for bookkeeping purposes. Do not use it in your code.
        self.sentence = sentence

        self.stack = ['ROOT']
        self.buffer = sentence[:]
        self.dependencies = []
        
    def parse_step(self, transition):
        """Performs a single parse step by applying the given transition to this partial parse

        Args:
            transition: A string that equals "S", "LA", or "RA" representing the shift, left-arc,
                        and right-arc transitions. You can assume the provided transition is a legal
                        transition.
        """
        if transition == 'S':
            self.stack.append(self.buffer[0])
            self.buffer.pop(0)
        elif transition == 'LA':
            self.dependencies.append((self.stack[-1], self.stack[-2]))
            self.stack.pop(-2)
        else:
            self.dependencies.append((self.stack[-2], self.stack[-1]))
            self.stack.pop(-1)
            
    def parse(self, transitions):
        """Applies the provided transitions to this PartialParse

        Args:
            transitions: The list of transitions in the order they should be applied
        Returns:
            dependencies: The list of dependencies produced when parsing the sentence. Represented
                          as a list of tuples where each tuple is of the form (head, dependent)
        """
        for transition in transitions:
            self.parse_step(transition)
        return self.dependencies

def test_step(name, transition, stack, buf, deps,
              ex_stack, ex_buf, ex_deps):
    """Tests that a single parse step returns the expected output"""
    pp = PartialParse([])
    pp.stack, pp.buffer, pp.dependencies = stack, buf, deps

    pp.parse_step(transition)
    stack, buf, deps = (tuple(pp.stack), tuple(pp.buffer), tuple(sorted(pp.dependencies)))
    assert stack == ex_stack, \
        "{:} test resulted in stack {:}, expected {:}".format(name, stack, ex_stack)
    assert buf == ex_buf, \
        "{:} test resulted in buffer {:}, expected {:}".format(name, buf, ex_buf)
    assert deps == ex_deps, \
        "{:} test resulted in dependency list {:}, expected {:}".format(name, deps, ex_deps)
    print "{:} test passed!".format(name)

调用函数试一下：

test

Our network will predict which transition should be applied next to a partial parse. We could use it to parse a single sentence by applying predicted transitions until the parse is complete. However, neural networks run much more efficiently when making predictions about batches of data at a time (i.e., predicting the next transition for a many different partial parses simultaneously). We can parse sentences in minibatches with the following algorithm.

就是一个parser一次处理一个sentence，怎么能批量处理 sentences ~~ miniBatch ~~

miniBatchParsing

写个程序试一下

import copy
def minibatch_parse(sentences, model, batch_size):
    """Parses a list of sentences in minibatches using a model.

    Args:
        sentences: A list of sentences to be parsed (each sentence is a list of words)
        model: The model that makes parsing decisions. It is assumed to have a function
               model.predict(partial_parses) that takes in a list of PartialParses as input and
               returns a list of transitions predicted for each parse. That is, after calling
                   transitions = model.predict(partial_parses)
               transitions[i] will be the next transition to apply to partial_parses[i].
        batch_size: The number of PartialParses to include in each minibatch
    Returns:
        dependencies: A list where each element is the dependencies list for a parsed sentence.
                      Ordering should be the same as in sentences (i.e., dependencies[i] should
                      contain the parse for sentences[i]).
    """

    partial_parses = [PartialParse(s) for s in sentences]

    unfinished_parse = copy.copy(partial_parses)

    while len(unfinished_parse) > 0:
        minibatch = unfinished_parse[0:batch_size]
        # perform transition and single step parser on the minibatch until it is empty
        while len(minibatch) > 0:
            transitions = model.predict(minibatch)
            for index, action in enumerate(transitions):
                minibatch[index].parse_step(action)
            minibatch = [parse for parse in minibatch if len(parse.stack) > 1 or len(parse.buffer) > 0]

        # move to the next batch
        unfinished_parse = unfinished_parse[batch_size:]

    dependencies = []
    for n in range(len(sentences)):
        dependencies.append(partial_parses[n].dependencies)

    return dependencies

class DummyModel(object):
    """Dummy model for testing the minibatch_parse function
    First shifts everything onto the stack and then does exclusively right arcs if the first word of
    the sentence is "right", "left" if otherwise.
    """
    def predict(self, partial_parses):
        return [("RA" if pp.stack[1] is "right" else "LA") if len(pp.buffer) == 0 else "S" for pp in partial_parses]

def test_dependencies(name, deps, ex_deps):
    """Tests the provided dependencies match the expected dependencies"""
    deps = tuple(sorted(deps))
    assert deps == ex_deps, \
        "{:} test resulted in dependency list {:}, expected {:}".format(name, deps, ex_deps)

testminiBatchParsing