Paper Reading on Neural Machine Translation

2019.12.16

1. Sequence to Sequence Learning with Neural Networks

Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks. Advances in Neural Information Processing Systems, 4(January), 3104–3112.

1.1 Introduction

Deep Neural Networks (DNNs) are powerful models that have achieved excellent performance on difficult learning tasks. Although DNNs work well whenever large labeled training sets are available, they cannot be used to map sequences to sequences. In this paper, we present a general end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure. Our method uses a multilayered Long Short-Term Memory (LSTM) to map the input sequence to a vector of a fixed dimensionality, and then another deep LSTM to decode the target sequence from the vector. Our main result is that on an English to French translation task from the WMT-14 dataset, the translations produced by the LSTM achieve a BLEU score of 34.8 on the entire test set, where the LSTM’s BLEU score was penalized on out-of-vocabulary words. Additionally, the LSTM did not have difficulty on long sentences. For comparison, a phrase-based SMT system achieves a BLEU score of 33.3 on the same dataset. When we used the LSTM to rerank the 1000 hypotheses produced by the aforementioned SMT system, its BLEU score increases to 36.5, which is close to the previous state of the art. The LSTM also learned sensible phrase and sentence representations that are sensitive to word order and are relatively invariant to the active and the passive voice. Finally, we found that reversing the order of the words in all source sentences (but not target sentences) improved the LSTM’s performance markedly, because doing so introduced many short term dependencies between the source and the target sentence which made the optimization problem easier.

深度神经网络很强大，在困难的学习任务上有很好的表现。尽管DNN在有大型标注训练集的情况下表现良好，但是却不能被用于序列到序列的映射中。本文，本文提出一种通用的端到端的序列学习方法，这种方法对序列的结构只有最小的假设。这种方法使用多层的长短时记忆，将输入序列映射到固定维度的向量上，之后另一个深度LSTM从这个向量中解码到目标序列中。在英译法任务的WMT-14数据集上，BLEU分数为34.8，其中LSTM的BLEU得分对词汇量以外的单词进行了惩罚。在长句中效果也很好。基于语句的SMT系统的分数为33.3。如果先使用SMT再用LSTM，分数为36.5，接近sota。LSTM学习结果对词序敏感，而对主被动则相对不变。最后，如果反转源句子的词序，可以显著提升性能，因为会在源与目标间引入短期依赖关系，利于优化。

DNN只能应用到输入与目标可以被编码为固定维数的向量的问题中。这个限制很重，因为许多序列问题长度未知。语音识别和机器翻译就是序列问题。问题回答也可以看作是将问题词序列映射为答案词序列。因此研究域无关的序列映射很有用。

序列问题对DNN而言是一个挑战。本文中使用LSTM结构解决了通用的序列映射问题。用一个LSTM读入输入序列，获得一个大规模固定长度的向量表示，之后使用LSTM从向量中提取输出序列。第二个LSTM本质上是一个RNN语言模型，其以输入序列为条件。LSTM在具有长时间滞后依赖的数据上很成功。本文还提出将源语句的词序颠倒，这样就引入了很多短期依赖，利于优化。

LSTM 将变长输入序列映射为固定长度向量表示。鉴于翻译是源语句的含义，因此翻译目标鼓励LSTM发现包含含义的句子表示。相似含义的句子距离较近，反之则比较远。一个量化评估支持了这种观点，表明此模型知道词序，而对主被动保持不变。

1.2 Model

RNN：输入为序列 $(x_1,\cdots,x_T)$ ，输出为序列 $(y_1,\cdots,y_T)$ ，其中

\begin{aligned} h_t &= \mathrm{sigm}(W^{hx}x_t + W^{hh}h_{t-1})\\ y_t &= W^{yh}h_t \end{aligned}

RNN 可以将序列映射到序列，但却无法应用于输入和输出序列的长度不同且具有复杂和非单调关系的问题。一个简单的策略是使用一个RNN将输入序列映射到定长向量，再用另一个RNN将该向量映射为目标序列。理论上因为RNN被提供了所有的相关信息，因此上述策略是可行的，但是因为产生了长期依赖关系，RNN很难训练。而LSTM可以学习到具有长期时间依赖性的问题，因此在此设置下可能会成功。

LSTM 计算 $p(y_1,\cdots,y_{T'}|x_1,\cdots,x_T)$ ，首先以最后一个隐藏层 $v$ 作为定长向量表示，之后使用标准LSTM-LM公式计算 $y_1,\cdots,y_{T'}$ 的概率，其初始隐状态设置为 $v$ ：

p(y_1,\cdots,y_{T'}|x_1,\cdots,x_T) = \prod_{t=1}^{T'} p(y_t|v, y_1,\cdots,y_{t-1})

式中每个 $p(y_t|v,y_1,\cdots,y_{t-1})$ 分布都由一个词表上的softmax表示。此外还要求序列以"<EOS>"结尾，这样就可以定义任意长序列上的分布。针对输入与输出序列，本文使用了两个不同的LSTM，这样在可忽略的计算成本下增加了模型参数，而且可以自然应用到多种语言对上。此外，深度LSTM优于普通的LSTM，因此使用了四层的LSTM。最后，颠倒输入序列词序很有用，加速了SGD的优化，提升了LSTM的性能。

1.3 Experiments

1.3.1

WMT14 English to French dataset. Train: 12M sentences subset, 348M French words & 304M English words. Using 160,000 most frequent words for source language and 80,000 for target language. Out-of-vocabulary 'UNK'.

训练目标：

\frac{1}{|\mathcal{S}|} \sum_{(T,S)\in \mathcal{S}} \log p(T|S)

完成训练后的翻译：

\hat{T} = \argmax_T p(T|S)

通过从左到右的束搜索找到最大似然概率对应的翻译。

把源语句的顺序颠倒可以提高效果，Bleu 分数由25.9增加到30.6。这可能是因为最小时间延迟被减小了，反向传播就可以更快地在源与目标间建立联系。

1.3.2

4层LSTM，每层1000个cell，词embedding为1000维，输入词表160,000，输出词表80,000，输出使用softmax。

2. Neural Machine Translation by Jointly Learning to Align and Translate

2.1 Introduction

Neural machine translation is a recently proposed approach to machine translation. Unlike the traditional statistical machine translation, the neural machine translation aims at building a single neural network that can be jointly tuned to maximize the translation performance. The models proposed recently for neural machine translation often belong to a family of encoder–decoders and encode a source sentence into a fixed-length vector from which a decoder generates a translation. In this paper, we conjecture that the use of a fixed-length vector is a bottleneck in improving the performance of this basic encoder–decoder architecture, and propose to extend this by allowing a model to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly. With this new approach, we achieve a translation performance comparable to the existing state-of-the-art phrase-based system on the task of English-to-French translation. Furthermore, qualitative analysis reveals that the (soft-)alignments found by the model agree well with our intuition.

神经机器翻译是最近提出的用于机器翻译的方法。不像传统统计机器翻译，神经机器翻译致力于建造单一的神经网络，可以被联合调整来最大化翻译效果。最近提出的神经机器翻译方法基本属于编码器-解码器家族，将源句子编码成一个定长的向量，解码器再从这个向量中生成翻译结果。本文我们推测定长向量的使用是提升基本的编解码模型效果的瓶颈，并提出了通过允许模型自动软搜索源句子与目标词的预测相关的部分，而不用将这部分形成硬段来进行扩展。使用这种方法，我们在英-法翻译上达到了与现存sota基于短语的方法可比较的结果。此外，定性的分析揭示了模型发现的软分段与我们的直觉相符。

原有方法将源句子编码为定长向量，需要将所有的必要信息压缩到该向量内，对长句的效果可能不好。为了解决这个问题，本文提出一个原有编解码器的扩展，学习联合进行对齐和翻译。每次，该模型生成翻译中的一个词，它会搜索一组源句子中最相关信息的的位置。模型基于与源位置相关上下文向量以及已生成的目标词来预测新的目标词。此模型不试图将整个输入句编码为定长向量，而是将其编码为一系列定长向量，并选择其中的子集来自适应地解码得到翻译。这使得模型不必将源句子中的所有信息挤压到定长向量中，从而可以更好地处理长句。本文展现了提出的联合对齐与翻译的模型显著提升了性能，且在长句中更明显。

2.2 Background

概率角度的翻译： $\argmax_{\boldsymbol{y}} p(\boldsymbol{y}|\boldsymbol{x})$ 。

现有方法：编解码，RNN，效果很好。

2.2.1 RNN Encoder-Decoder

编码：

\begin{aligned} h_t&=f(x_t, h_{t-1})\\ c&=q(\{h_1, \cdots, h_{T_x}\}) \end{aligned}

解码：

\begin{aligned} p(\boldsymbol{y}) = \prod_{t=1}^Tp(y_t|\{y_1,\cdots,y_{t-1}\},c) \\ p(y_t|\{y_1,\cdots,y_{t-1}\},c) = g(y_{t-1},s_t,c) \end{aligned}

2.3 对齐、翻译

编码器：双向RNN；解码器：翻译时模拟在源句子中搜索

解码器

p(y_i|y_1,\cdots,y_{i-1},\boldsymbol{x}) = g(y_{i-1}, s_i, c_i)

$s_i$ 是RNN在时刻 $i$ 的隐状态， $s_i=f(s_{i-1},y_{i-1},c_i)$ 。与之前相同的 $c$ 相比，这里每个 $y_i$ 的概率与一个独有的上下文向量 $c_i$ 相关。 $c_i$ 取决于由编码器映射输入序列得到的 $(h_1,\cdots,h_{T_x})$ ，每个标注 $h_i$ 包含全部输入序列的信息，并主要关注第 $i$ 个词。

c_i = \sum_{j=1}^{T_x}\alpha_{ij}h_j

\alpha_{ij} = \frac{\exp(e_{ij})}{\sum_{k=1}^{T_x}\exp(e_{ik})}

其中 $e_{ij}=a(s_{i-1},h_j)$ ，是对齐模型，表明输入 $j$ 位置与输出 $i$ 位置的匹配成都，基于RNN的隐状态 $s_{i-1}$ 与 $h_j$ 。将对齐模型 $a$ 参数化为前馈神经网络，与其他组件联合训练。对齐并不被认为是隐向量，相反，其直接计算软对齐，损失函数的梯度反向传播。梯度可以被用来联合训练对齐模型与翻译模型。 $\alpha_{ij}$ 可以看作是目标词 $y_i$ 与 $x_j$ 对齐，或者说翻译自 $x_j$ 的概率。第 $i$ 个上下文向量 $c_i$ 是概率 $\alpha_{ij}$ 下的加权期望标注。

$\alpha_{ij}$ 或 $e_{ij}$ 反映标注 $h_j$ 关于前一个隐状态 $s_{i-1}$ 在决定下一个状态 $s_i$ 与生成 $y_i$ 的重要性。直觉上，这是一种解码器中注意力的实现机制。通过注意力机制，减轻了编码器必须将源句子中所有信息编码进定长向量的压力。利用这种新方法，信息可以散布在标注的整个序列中，可以相应地由解码器有选择地检索。

编码器

使用双向RNN，BiRNN。BiRNN 由前传和反传RNN组成。前传RNN顺序读入输入序列，计算前传隐状态；反传RNN逆序读入输入序列，计算反传隐状态。最终的隐状态 $h_j$ 由前传和反传隐状态连接得到，这样就可以包含前后词的信息了，同时又主要关注当前的输入 $x_j$ 。这里的隐状态用于计算解码器中的上下文向量。

2.4 实验

在英-法翻译任务上进行了验证。使用ACL WMT14 提供的双语并行语料库。细节见原文。

训了两种模型。第一种是RNN编解码器，第二种则是本文提出的模型。细节见原文。

2.5 结果

RNNsearch 好于 RNNencdec；当只有已知词时，RNNsearch 好于基于短语的 Moses。RNNsearch-50 在长句上表现很好。

对齐的表现很好。如 the man 翻译为 l'homme，翻译 the 为 l' 时，the 和 man 的权重都很大。考虑到只有知道名词的性时才能确定 the 翻译为 le、la、les 还是 l'，这很符合直觉。

3. Effective Approaches to Attention-based Neural Machine Translation

3.1 Introduction

An attentional mechanism has lately been used to improve neural machine translation (NMT) by selectively focusing on parts of the source sentence during translation. However, there has been little work exploring useful architectures for attention-based NMT. This paper examines two simple and effective classes of attentional mechanism: a global approach which always attends to all source words and a local one that only looks at a subset of source words at a time. We demonstrate the effectiveness of both approaches on the WMT translation tasks between English and German in both directions. With local attention, we achieve a significant gain of 5.0 BLEU points over non-attentional systems that already incorporate known techniques such as dropout. Our ensemble model using different attention architectures yields a new state-of-the-art result in the WMT’15 English to German translation task with 25.9 BLEU points, an improvement of 1.0 BLEU points over the existing best system backed by NMT and an n-gram reranker.

通过选择性地在翻译中关注源句子的部分，注意力机制最近被用来提升神经机器翻译的。然后，很少有工作探究基于注意力的NMT的结构。本文考察了两类简单有效的注意力机制：注意所有源词的Global方法与一次只关注源词的一个子集的Local方法。我们展现了两种方法在WMT英-德翻译任务中的有效性。本地注意力可以比使用已知dropout等已知技术的非注意力系统获得5点的bleu增益。我们使用不同注意力结构的集成模型，得到了sota的结果，bleu分数25.9，比之前的基于NMT和n-gram的系统高了1个点。

Global 方法与 Bahdanau et al., 2015 的方法类似，结构更简单；local 方法可以看作硬注意力与软注意力的结合：计算量少于global或者说软注意力，而又不像硬注意力，几乎处处可微，便于实现与训练。同时考察了各种对齐函数。实验中在WMT14与WMT15 均为sota。分析了学习、长句、注意力选择、对齐质量、翻译输出。

3.2 算法

NMT

NMT 结构一般使用 RNN，在解码器使用什么RNN类别即编码器如何计算源句子表示时有区别。本文使用堆叠LSTM结构，训练目标为：

J_t = \sum_{(x,y)\in \mathbb{D}} -\log p(y|x)

Attention

\begin{aligned} \tilde{h}_t &= \tanh (W_c[c_t;h_t])\\ p(y_t|Y_{<t>}, x) &= \mathrm{softmax}(W_s\tilde{h}_t) \end{aligned}

Global

变长对齐向量：

\begin{aligned} a_t(s) &= \mathrm{align}(h_t,\tilde{h}_s) \\ &=\frac{\exp(\mathrm{score}(h_t,\tilde{h}_s))}{\sum_{s'}\exp(\mathrm{score}(h_t,\bar{h}_{s'}))} \end{aligned}

\mathrm{score}(h_t,\bar{h}_s) = \begin{cases} h_t^{\mathrm{T}}\bar{h}_s & dot\\ h_t^{\mathrm{T}}W_a\bar{h}_s & general\\ v_a^{\mathrm{T}}\tanh(W_a[h_t;\bar{h}_s]) & concat \end{cases}

基于位置的：

a_t = \mathrm{softmax}(W_ah_t)

与 Bahdanau et al., 2015 相比，本文的方法简化并泛化了。其一，只使用了LSTM顶层的隐状态，在双向编码器中使用前向与反向源隐状态的级联，在非堆叠单向解码器中使用目标隐状态的级联。其二，计算路径更简单。最后，前文只实验了一种对齐函数，而其他选择更好。

Local

全局注意力对每个目标词，要考虑源句子中的所有词，这比较费计算量，还可能使翻译段落、文档这样的长序列变得不切实际。为了解决这个问题，我们提出了本地注意力机制，对每个目标词，只关注源句子的一个子集。

灵感来自Xu et al. 2015 中软注意力与硬注意力的tradeoff。该文中软注意力关注源图像的所有批次，而硬注意力只关注一个批次。推断更快，但不可微，需要变分或强化学习来训练。

本文中local机制选择关注上下文的一个小时间窗口，可微分，可以避免软注意力的昂贵计算，更易训练。首先，模型先生成时间 $t$ 的对齐位置 $p_t$ ，上下文向量 $c_t$ 由时间窗 $[p_t-D,p_t+D]$ 内的源隐状态集合的加权平均值， $D$ 由经验选择。global机制中 $a_t$ 为变长，而 $a_t$ 为定长， $a_t\in \mathbb{R}^{2D+1}$ 。两种变体：

单调对齐： $p_t=t$
预测对齐： $p_t=S \cdot \mathrm{sigmoid}(v_p^\mathrm{T}\tanh(W_ph_t))$
- $W_p,\,v_p$ 为可学习参数， $S$ 是源序列长度。
- $\displaystyle{a_t(s) = \mathrm{align}(h_t,\bar{h}_s)\exp\Big(-\frac{(s-p_t)^2}{2\sigma^2}\Big)}$
- 加入高斯分布，支持 $p_t$ 附近的对齐点
- $\sigma=\frac{D}{2}$

4. Attention is All You Need

4.1 Introduction

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.0 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature.

占优势地位的序列转导模型基于复杂的递归或卷积神经网络，包括编码器和解码器。表现最佳的模型还通过注意力机制连接编码器和解码器。我们提出了一种新的简单网络架构，即Transformer，它完全基于注意力机制，完全消除了循环和卷积。在两个机器翻译任务上进行的实验表明，这些模型在质量上具有优势，同时具有更高的可并行性，并且所需的训练时间明显更少。我们的模型在2014年WMT英德翻译任务中达到28.4 BLEU，比包括集成在内的现有最佳结果提高了2 BLEU。在2014年WMT英语到法语翻译任务中，我们的模型在8个GPU上进行了3.5天的训练后，建立了新的单模型最新BLEU分数41.0，这仅是文献中的最佳模型培训成本的一小部分。