精读：Attention算法原文
郝伟 2022/01/23

0 论文简介

在深度学习中，2017年的Attention机制影响重大，本文对Attention的原论文进行解读。

0.1 论文基本信息

论文题目：Attention is All you need
发表级别：31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.
全称神经信息处理系统大会(Conference and Workshop on Neural Information Processing Systems)，是一个关于机器学习和计算神经科学的国际会议。该会议固定在每年的12月举行,由NIPS基金会主办，是机器学习领域的顶级会议。
作者列表：Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin
作者单位：Google
简介页面：链接
下载地址：链接

0.2 摘要

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

0.3 关键信息

在seq2seq转义中，通常利用RNN或CNN基于encoder-decoder的配置；
提出基于Attention的Transformer架构，完全不用考虑RNN和CNN

0.4 主要问题

知识库建设的核心问题
从无（半）结构的互联网信息中获取有结构知识，自动融合构建知识库、服务知识推理等相关应用。知识表示是知识获取与应用的基础，因此知识表示学习问题是贯穿知识库的构建与应用全过程的关键问题。
RDF=Resource Description Framework，W3C制定的标准，基于三元组
谷歌提出知识图谱

但是，基于三元组的表达方式存在两大问题：
0. 计算效率问题存在问题，移植性，扩展性，海量数据处理问题
2. 数据稀疏问题，导致长尾问题

1 表示学习

1.1 基本概念

Attention机制
Attention用于计算"相关程度", 例如在翻译过程中，不同的英文对中文的依赖程度不同，Attention通常可以进行如下描述：
$\textrm{Query}(Q) + KVS \rightarrow OUTPUT$
其中
query、每个key、每个value都是向量，输出是V中所有values的加权，其中权重是由Query和每个key计算出来的，计算方法分为三步：
$y = f(K_i, Q_i), , i = 1, 2, 3, ..., m (下同)$

1.2 Attention计算

1.2.1 Step 1: 相似度

计算 $y=f(Q, K_i)$ ，包括以下四种表达式：

点乘 (dot product)： $f(K_i, Q_i)=Q^TK_i$
权重 (general)： $f(K_i, Q_i)=Q^TWK_i$
拼接权重 (concat)： $f(K_i, Q_i)=W[Q; K_i]$
感知器 (perceptron)： $f(K_i, Q_i)=V^Ttanh(WQ+UK_i)$

1.2.2 Step 2: 分类归一

将得到的相似度进行Softmax操作，进行归一化

$a_i = \frac{e^{f(Q, K)}}{\sum_{j=1}^m f(Q, K)}, i = 1, 2, ..., m$

Step 3：Attention向量计算

利用计算出的权重 $a_i$ ，对 $V$ 中所有的值进行加权求和计算，以得到Attention向量：

$\textrm{Attention}(Q, K, V)=\sum_{i=1}^m a_iV_i \tag 1$

$\textrm{Attention}(Q, K, V ) = \textrm{softmax}(\frac{QK^T}{\sqrt{d_k}})V \tag 2$

1

What are encoder and eecoder?

A encoder maps an input sequence of symbol representations $\textbf{x}=(x_1, ..., x_n)$ to a sequence of continuous representations $\textbf{z} = (z_1, ..., z_n)$ . Given z, the decoder then generates an output sequence $(y_1, ..., y_m)$ of symbols one element at a time. At each step the model is auto-regressive, consuming the previously generated symbols as additional input when generating the next.