1. Sequence to Sequence Learning with Neural Networks

Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks. Advances in Neural Information Processing Systems, 4(January), 3104–3112.

1.1 Introduction

Deep Neural Networks (DNNs) are powerful models that have achieved excellent performance on difficult learning tasks. Although DNNs work well whenever large labeled training sets are available, they cannot be used to map sequences to sequences. In this paper, we present a general end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure. Our method uses a multilayered Long Short-Term Memory (LSTM) to map the input sequence to a vector of a fixed dimensionality, and then another deep LSTM to decode the target sequence from the vector. Our main result is that on an English to French translation task from the WMT-14 dataset, the translations produced by the LSTM achieve a BLEU score of 34.8 on the entire test set, where the LSTM’s BLEU score was penalized on out-of-vocabulary words. Additionally, the LSTM did not have difficulty on long sentences. For comparison, a phrase-based SMT system achieves a BLEU score of 33.3 on the same dataset. When we used the LSTM to rerank the 1000 hypotheses produced by the aforementioned SMT system, its BLEU score increases to 36.5, which is close to the previous state of the art. The LSTM also learned sensible phrase and sentence representations that are sensitive to word order and are relatively invariant to the active and the passive voice. Finally, we found that reversing the order of the words in all source sentences (but not target sentences) improved the LSTM’s performance markedly, because doing so introduced many short term dependencies between the source and the target sentence which made the optimization problem easier.




LSTM 将变长输入序列映射为固定长度向量表示。鉴于翻译是源语句的含义,因此翻译目标鼓励LSTM发现包含含义的句子表示。相似含义的句子距离较近,反之则比较远。一个量化评估支持了这种观点,表明此模型知道词序,而对主被动保持不变。

1.2 Model


ht=sigm(Whxxt+Whhht1)yt=Wyhht \begin{aligned} h_t &= \mathrm{sigm}(W^{hx}x_t + W^{hh}h_{t-1})\\ y_t &= W^{yh}h_t \end{aligned}

RNN 可以将序列映射到序列,但却无法应用于输入和输出序列的长度不同且具有复杂和非单调关系的问题。一个简单的策略是使用一个RNN将输入序列映射到定长向量,再用另一个RNN将该向量映射为目标序列。理论上因为RNN被提供了所有的相关信息,因此上述策略是可行的,但是因为产生了长期依赖关系,RNN很难训练。而LSTM可以学习到具有长期时间依赖性的问题,因此在此设置下可能会成功。

LSTM 计算p(y1,,yTx1,,xT)p(y_1,\cdots,y_{T'}|x_1,\cdots,x_T),首先以最后一个隐藏层vv作为定长向量表示,之后使用标准LSTM-LM公式计算y1,,yTy_1,\cdots,y_{T'}的概率,其初始隐状态设置为vv

p(y1,,yTx1,,xT)=t=1Tp(ytv,y1,,yt1) p(y_1,\cdots,y_{T'}|x_1,\cdots,x_T) = \prod_{t=1}^{T'} p(y_t|v, y_1,\cdots,y_{t-1})


1.3 Experiments


WMT14 English to French dataset. Train: 12M sentences subset, 348M French words & 304M English words. Using 160,000 most frequent words for source language and 80,000 for target language. Out-of-vocabulary 'UNK'.


1S(T,S)Slogp(TS) \frac{1}{|\mathcal{S}|} \sum_{(T,S)\in \mathcal{S}} \log p(T|S)


T^=arg maxTp(TS) \hat{T} = \argmax_T p(T|S)


把源语句的顺序颠倒可以提高效果,Bleu 分数由25.9增加到30.6。这可能是因为最小时间延迟被减小了,反向传播就可以更快地在源与目标间建立联系。



2. Neural Machine Translation by Jointly Learning to Align and Translate

2.1 Introduction

Neural machine translation is a recently proposed approach to machine translation. Unlike the traditional statistical machine translation, the neural machine translation aims at building a single neural network that can be jointly tuned to maximize the translation performance. The models proposed recently for neural machine translation often belong to a family of encoder–decoders and encode a source sentence into a fixed-length vector from which a decoder generates a translation. In this paper, we conjecture that the use of a fixed-length vector is a bottleneck in improving the performance of this basic encoder–decoder architecture, and propose to extend this by allowing a model to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly. With this new approach, we achieve a translation performance comparable to the existing state-of-the-art phrase-based system on the task of English-to-French translation. Furthermore, qualitative analysis reveals that the (soft-)alignments found by the model agree well with our intuition.



2.2 Background

概率角度的翻译:arg maxyp(yx)\argmax_{\boldsymbol{y}} p(\boldsymbol{y}|\boldsymbol{x})


2.2.1 RNN Encoder-Decoder


ht=f(xt,ht1)c=q({h1,,hTx}) \begin{aligned} h_t&=f(x_t, h_{t-1})\\ c&=q(\{h_1, \cdots, h_{T_x}\}) \end{aligned}


p(y)=t=1Tp(yt{y1,,yt1},c)p(yt{y1,,yt1},c)=g(yt1,st,c) \begin{aligned} p(\boldsymbol{y}) = \prod_{t=1}^Tp(y_t|\{y_1,\cdots,y_{t-1}\},c) \\ p(y_t|\{y_1,\cdots,y_{t-1}\},c) = g(y_{t-1},s_t,c) \end{aligned}

2.3 对齐、翻译



p(yiy1,,yi1,x)=g(yi1,si,ci) p(y_i|y_1,\cdots,y_{i-1},\boldsymbol{x}) = g(y_{i-1}, s_i, c_i)


ci=j=1Txαijhj c_i = \sum_{j=1}^{T_x}\alpha_{ij}h_j
αij=exp(eij)k=1Txexp(eik) \alpha_{ij} = \frac{\exp(e_{ij})}{\sum_{k=1}^{T_x}\exp(e_{ik})}




使用双向RNN,BiRNN。BiRNN 由前传和反传RNN组成。前传RNN顺序读入输入序列,计算前传隐状态;反传RNN逆序读入输入序列,计算反传隐状态。最终的隐状态hjh_j由前传和反传隐状态连接得到,这样就可以包含前后词的信息了,同时又主要关注当前的输入xjx_j。这里的隐状态用于计算解码器中的上下文向量。

2.4 实验

在英-法翻译任务上进行了验证。使用ACL WMT14 提供的双语并行语料库。细节见原文。


2.5 结果

RNNsearch 好于 RNNencdec;当只有已知词时,RNNsearch 好于基于短语的 Moses。RNNsearch-50 在长句上表现很好。

对齐的表现很好。如 the man 翻译为 l'homme,翻译 the 为 l' 时,the 和 man 的权重都很大。考虑到只有知道名词的性时才能确定 the 翻译为 le、la、les 还是 l',这很符合直觉。

3. Effective Approaches to Attention-based Neural Machine Translation

3.1 Introduction

An attentional mechanism has lately been used to improve neural machine translation (NMT) by selectively focusing on parts of the source sentence during translation. However, there has been little work exploring useful architectures for attention-based NMT. This paper examines two simple and effective classes of attentional mechanism: a global approach which always attends to all source words and a local one that only looks at a subset of source words at a time. We demonstrate the effectiveness of both approaches on the WMT translation tasks between English and German in both directions. With local attention, we achieve a significant gain of 5.0 BLEU points over non-attentional systems that already incorporate known techniques such as dropout. Our ensemble model using different attention architectures yields a new state-of-the-art result in the WMT’15 English to German translation task with 25.9 BLEU points, an improvement of 1.0 BLEU points over the existing best system backed by NMT and an n-gram reranker.


Global 方法与 Bahdanau et al., 2015 的方法类似,结构更简单;local 方法可以看作硬注意力与软注意力的结合:计算量少于global或者说软注意力,而又不像硬注意力,几乎处处可微,便于实现与训练。同时考察了各种对齐函数。实验中在WMT14与WMT15 均为sota。分析了学习、长句、注意力选择、对齐质量、翻译输出。

3.2 算法


NMT 结构一般使用 RNN,在解码器使用什么RNN类别即编码器如何计算源句子表示时有区别。本文使用堆叠LSTM结构,训练目标为:

Jt=(x,y)Dlogp(yx) J_t = \sum_{(x,y)\in \mathbb{D}} -\log p(y|x)


h~t=tanh(Wc[ct;ht])p(ytY<t>,x)=softmax(Wsh~t) \begin{aligned} \tilde{h}_t &= \tanh (W_c[c_t;h_t])\\ p(y_t|Y_{<t>}, x) &= \mathrm{softmax}(W_s\tilde{h}_t) \end{aligned}


at(s)=align(ht,h~s)=exp(score(ht,h~s))sexp(score(ht,hˉs)) \begin{aligned} a_t(s) &= \mathrm{align}(h_t,\tilde{h}_s) \\ &=\frac{\exp(\mathrm{score}(h_t,\tilde{h}_s))}{\sum_{s'}\exp(\mathrm{score}(h_t,\bar{h}_{s'}))} \end{aligned}
score(ht,hˉs)={htThˉsdothtTWahˉsgeneralvaTtanh(Wa[ht;hˉs])concat \mathrm{score}(h_t,\bar{h}_s) = \begin{cases} h_t^{\mathrm{T}}\bar{h}_s & dot\\ h_t^{\mathrm{T}}W_a\bar{h}_s & general\\ v_a^{\mathrm{T}}\tanh(W_a[h_t;\bar{h}_s]) & concat \end{cases}


at=softmax(Waht) a_t = \mathrm{softmax}(W_ah_t)

与 Bahdanau et al., 2015 相比,本文的方法简化并泛化了。其一,只使用了LSTM顶层的隐状态,在双向编码器中使用前向与反向源隐状态的级联,在非堆叠单向解码器中使用目标隐状态的级联。其二,计算路径更简单。最后,前文只实验了一种对齐函数,而其他选择更好。



灵感来自Xu et al. 2015 中软注意力与硬注意力的tradeoff。该文中软注意力关注源图像的所有批次,而硬注意力只关注一个批次。推断更快,但不可微,需要变分或强化学习来训练。

本文中local机制选择关注上下文的一个小时间窗口,可微分,可以避免软注意力的昂贵计算,更易训练。首先,模型先生成时间tt的对齐位置ptp_t,上下文向量ctc_t由时间窗[ptD,pt+D][p_t-D,p_t+D]内的源隐状态集合的加权平均值,DD由经验选择。global机制中ata_t为变长,而ata_t为定长,atR2D+1a_t\in \mathbb{R}^{2D+1}。两种变体:

4. Attention is All You Need

4.1 Introduction

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.0 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature.

占优势地位的序列转导模型基于复杂的递归或卷积神经网络,包括编码器和解码器。表现最佳的模型还通过注意力机制连接编码器和解码器。我们提出了一种新的简单网络架构,即Transformer,它完全基于注意力机制,完全消除了循环和卷积。在两个机器翻译任务上进行的实验表明,这些模型在质量上具有优势,同时具有更高的可并行性,并且所需的训练时间明显更少。我们的模型在2014年WMT英德翻译任务中达到28.4 BLEU,比包括集成在内的现有最佳结果提高了2 BLEU。 在2014年WMT英语到法语翻译任务中,我们的模型在8个GPU上进行了3.5天的训练后,建立了新的单模型最新BLEU分数41.0,这仅是文献中的最佳模型培训成本的一小部分。