精读:Attention算法原文
郝伟 2022/01/23

0 论文简介

在深度学习中,2017年的Attention机制影响重大,本文对Attention的原论文 进行解读。

0.1 论文基本信息

论文题目:Attention is All you need
发表级别:31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.
全称神经信息处理系统大会(Conference and Workshop on Neural Information Processing Systems),是一个关于机器学习和计算神经科学的国际会议。该会议固定在每年的12月举行,由NIPS基金会主办,是机器学习领域的顶级会议。
作者列表:Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin
作者单位:Google
简介页面:链接
下载地址:链接

0.2 摘要

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

0.3 关键信息

  1. 在seq2seq转义中,通常利用RNN或CNN基于encoder-decoder的配置;
  2. 提出基于Attention的Transformer架构,完全不用考虑RNN和CNN

0.4 主要问题

但是,基于三元组的表达方式存在两大问题:
0. 计算效率问题存在问题,移植性,扩展性,海量数据处理问题
2. 数据稀疏问题,导致长尾问题

1 表示学习

1.1 基本概念

  1. Attention机制
    Attention用于计算"相关程度", 例如在翻译过程中,不同的英文对中文的依赖程度不同,Attention通常可以进行如下描述:
    Query(Q)+KVSOUTPUT\textrm{Query}(Q) + KVS \rightarrow OUTPUT
    其中
    query、每个key、每个value都是向量,输出是V中所有values的加权,其中权重是由Query和每个key计算出来的,计算方法分为三步:
    y=f(Ki,Qi),,i=1,2,3,...,m(下同)y = f(K_i, Q_i), , i = 1, 2, 3, ..., m (下同)

1.2 Attention计算

1.2.1 Step 1: 相似度

计算y=f(Q,Ki)y=f(Q, K_i),包括以下四种表达式:

1.2.2 Step 2: 分类归一

将得到的相似度进行Softmax操作,进行归一化

ai=ef(Q,K)j=1mf(Q,K),i=1,2,...,ma_i = \frac{e^{f(Q, K)}}{\sum_{j=1}^m f(Q, K)}, i = 1, 2, ..., m

Step 3:Attention向量计算

利用计算出的权重 aia_i,对 VV 中所有的值进行加权求和计算,以得到Attention向量:

Attention(Q,K,V)=i=1maiVi(1)\textrm{Attention}(Q, K, V)=\sum_{i=1}^m a_iV_i \tag 1

Attention(Q,K,V)=softmax(QKTdk)V(2)\textrm{Attention}(Q, K, V ) = \textrm{softmax}(\frac{QK^T}{\sqrt{d_k}})V \tag 2

1

What are encoder and eecoder?

A encoder maps an input sequence of symbol representations x=(x1,...,xn)\textbf{x}=(x_1, ..., x_n) to a sequence of continuous representations z=(z1,...,zn)\textbf{z} = (z_1, ..., z_n). Given z, the decoder then generates an output sequence (y1,...,ym)(y_1, ..., y_m) of symbols one element at a time. At each step the model is auto-regressive, consuming the previously generated symbols as additional input when generating the next.