论文精读：A Comprehensive Survey on Knowledge Graph Entity Alignment via Representation Learning
郝伟 2021/01/12

文章基本信息
- 作者信息
- 摘要
1 Introduction
2 基本知识
3.1 Generic Framework of Embedding-based EA
4 KG Structure Embedding Models
- 4.1 Translation-based Embedding Models
- 4.2 GNN-based Embedding Models (GNNs)
  - Graph Convolutional Networks (GCNs)
  - Graph Attention Networks (GAT)
5 Translation-based EA Techniques
- 5.1 Techniques that Only Use KG Structure
- 5.2 Techniques that Exploit Relation Predicates and Attributes
6 GNN-based EA Techniques
- 6.1 GCN-based EA Techniques
- 6.2 GAT-based EA Techniques
7 Datasets and Experimental Studies
8 Conclusions and Future Directions
- 8.1 总结
- 8.2 未来研究方向

文章基本信息

@article{DBLP:journals/corr/abs-2103-15059,
  author    = {Rui Zhang and
               Bayu Distiawan Trisedya and
               Miao Li and
               Yong Jiang and
               Jianzhong Qi},
  title     = {A Comprehensive Survey on Knowledge Graph Entity Alignment via Representation Learning},
  journal   = {CoRR},
  volume    = {abs/2103.15059},
  year      = {2021},
  url       = {https://arxiv.org/abs/2103.15059},
  eprinttype = {arXiv},
  eprint    = {2103.15059},
  timestamp = {Thu, 15 Jul 2021 12:17:32 +0200},
  biburl    = {https://dblp.org/rec/journals/corr/abs-2103-15059.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

注： CoRR(Computer Research Repository), 计算机研究领域的论文库, 是arXiv电子服务的一部分，同时只关注与计算机领域的内容. CoRR和arXiv一样，都是Cornell大学运营的。参考网址： https://journals.lww.com/clinorthop/pages/default.aspx

文章下载：地址

作者信息

Rui Zhang, Tsinghua University, URL: https://ruizhang.info/, E-mail: rui.zhang@ieee.org
General Information
Dr Rui Zhang is an internationally leading researcher in the area of big data, data mining and machine learning. He is a Visiting Professor at Tsinghua University and was a Professor at the School of Computing and Information Systems of the University of Melbourne. Dr Zhang has won several awards including the prestigious Future Fellowship by the Australian Research Council in 2012, Chris Wallace Award for Outstanding Research by the Computing Research and Education Association of Australasia (CORE) in 2015, and Google Faculty Research Award in 2017. His inventions have been adopted by major IT companies such as Microsoft, Amazon and AT&T. Dr Rui Zhang obtained his Bachelor’s degree from Tsinghua University in 2001, PhD from National University of Singapore in 2006, and has then started as a faculty member in The University of Melbourne since 2007. Before joining the University of Melbourne, he has been a visiting research scientist at AT&T labs-research in New Jersey and at Microsoft Research in Redmond, Washington. He has also been a regular visiting researcher at Microsoft Research Asia in Beijing. Dr Zhang’s research interests include big data and AI, particularly in areas of recommendation systems, knowledge bases, chatbot, spatial and temporal data analytics, moving object management and data streams.
Bayu Distiawan Trisedya, The University of Melbourne
E-mail: bayu.trisedya@unimelb.edu.au
Miao Li
The University of Melbourne
E-mail: miao4@student.unimelb.edu.au
Yong Jiang
Tsinghua University
E-mail: jiangy@mail.sz.tsinghua.edu.cn
Jianzhong Qi
The University of Melbourne
E-mail: jianzhong.qi@unimelb.edu.au

摘要

Abstract In the last few years, the interest in knowledge bases has grown exponentially in both the research community and the industry due to their essential role in AI applications. Entity alignment is an important task for enriching knowledge bases. This paper provides a comprehensive tutorial-type survey on representative entity alignment techniques that use the new approach of representation learning. We present a framework for capturing the key characteristics of these techniques, propose two datasets to address the limitation of existing benchmark datasets, and conduct extensive experiments using the proposed datasets. The framework gives a clear picture of how the techniques work. The experiments yield important results about the empirical performance of the techniques and how various factors affect the performance. One important observation not stressed by previous work is that techniques making good use of attribute triples and relation predicates as features stand out as winners.

摘要在过去几年中，由于知识库在人工智能应用中的重要作用，研究界和工业界对知识库的兴趣呈指数级增长。实体对齐是丰富知识库的一项重要任务。本文提供了关于使用新的表示学习方法的代表性实体对齐技术的综合教程式调查。我们提出了一个框架来捕捉这些技术的关键特征，提出两个数据集来解决现有基准数据集的局限性，并使用所提出的数据集进行广泛的实验。该框架清晰地展示了这些技术的工作原理。实验产生了关于技术的经验性能以及各种因素如何影响性能的重要结果。以前的工作没有强调的一个重要观察是，充分利用属性三元组和关系谓词作为特征的技术作为赢家脱颖而出。

1 Introduction

1.1 定义

知识库 (Knowledge Base, KB)

Knowledge bases are a technology used to store complex structured and unstructured information, typically facts or knowledge.

知识图谱 (Knowledge Graph, KG)

A knowledge graph (KG), which is a knowledge base modeled by a graph structure or topology, is the most popular form of knowledge bases and
has almost become a synonym of knowledge base today.

实体对齐 (Entity Alignment, EA)

One of the most important tasks for KGs is entity alignment, which aims to identify entities from different KGs that represent the same real-world entity.

1.2 意义

Different KGs may be created via different sources and methods, so even entities representing the same real-world entity may be denoted differently in different KGs, and it is challenging to identify all such aligned entities accurately

1.3 EA传统方法

传统方法使用数据挖掘和数据库方法，通常是启发式推断，精度有限，方法不易泛化。
近几年，使用深度学习：先矢量化进行表示，再进行深度学习，获得了最好的精度和泛化能力。
注：这种矢量化的深度学习，称之为实体嵌入。

1.4 文章主要内容

文章内容：
(i) a comprehensive tutorial-type survey to help readers understand how each technique works with little need to refer to their full papers,
(ii) an up-to-date framework with new insights and a structure that captures the latest techniques, and
(iii) analysis on how different techniques compare to each other.
此外，还有对一些数据集的分析。

2 基本知识

2.1 常用符号

$\mathcal{G} = (\mathcal{E}, \mathcal{R}, \mathcal{A}, \mathcal{V}, \mathcal{T})$ 其中:

$\mathcal{G}$ 知识图谱
$\mathcal{E}$ 实体集合
$\mathcal{R}$ 关系谓语集合，描述主体间的关系。
$\mathcal{A}$ 属性集合
$\mathcal{V}$ 属性值集合
$\mathcal{T}$ $T$ 关系元组用于描述知识，
$\mathcal{T} = \mathcal{T}_r \cup \mathcal{T}_a$ $T = T_{r} \cup T_{a}$
- $\mathcal{T}_r=(h,r,t)$ , 其中 $e \in \mathcal{E} \;, \;h,t\in\mathcal{R}$ ，表示（头实体，关系，尾实体）三元组，例如 (Earth, orbits, Sun)；
- $\mathcal{T}_a=(e,a,v)$ , 其中 $e \in \mathcal{E} \;, \;a\in\mathcal{A}, \;v\in\mathcal{V}$ ，表示（实体，属性，属性值）三元组，例如 (Earth, radius, "6,371 km").
  更多如表1所示。

表1 常用符号一览表

2.2 EA数学定义

Given two KGs $\mathcal{G}_1 = (\mathcal{E}_1, \mathcal{R}_1, \mathcal{A}_1, \mathcal{V}_1, \mathcal{T}_1)$ and $\mathcal{G}_2 = (\mathcal{E}_2, \mathcal{R}_2, \mathcal{A}_2, \mathcal{V}_2, \mathcal{T}_2)$ , EA aims to identify every pair of
entities $(e_1, e_2)$ , $e_1 \in \mathcal{E}_1$ , $e_2 \in \mathcal{E}_1$ , where $e_1$ and $e_2$ represent the same real-world entity (i.e., e1 and e2 are
aligned entities).

简单来说，实体对齐找到两条在不同的数据集中，但是是对相同一实体的描述记录。

2.3 相关问题

由于EA是从两个库中进行比对，所以主要问题根据库是否结构化，可以分为以下两类：

Both sources structured
两个库都结构化，问题相对简单，主要研究内容是实体匹配和目标识别。
One source semi-structured and the other unstructured
问题相对复杂。研究内容包括，实体解析、实体链接等问题。

2.4 Traditional Techniques for EA

improving the effectiveness of the matching of entities via different entity similarity measures.
efficiency of entity matching

总结：Traditional EA techniques, as exemplified above, usually use data mining or database approaches, typically heuristics, to identify similar entities. It is difficult for them to achieve high accuracy and to generalize.

3.1 Generic Framework of Embedding-based EA

图1 基于实体嵌入的EA框架 (虚线表可选模块)

3.1 Embedding module

Embedding module 用于学习矢量表示（通常是低维度），也称为 embeddings of entities，主要包括以下四种类型：

KG structure (" the most critical one", 以 relation triples 的形式存储在原数据中）
relation predicates,
attribute predicates and
attribute values

机器学习以矢量为基础，GNN被用来进行学习。

主要目标：The embedding module computes the embeddings of each KG separately, which makes the embeddings of G1 and G2 fall into difffferent vector spaces.

3.2 Alignment module

主要目标：The alignment module aims to unify the embeddings of the two KGs into the same vector space so that aligned entities can be identified, which is a major challenge for EA.

最常见的方法是使用种子集合 $\mathcal{S}$ ，满足以下条件： $\mathcal{S} = \{(e_1, e_2) | e_1 \in \mathcal{E}_1, e_2 \in \mathcal{E}_2, e_1 \equiv e_2\}$

Loss函数 $f_{align}$

Loss函数用于定义评估函数 $\mathcal{L} = \sum_{e_1, e_2 \in \mathcal{S}} \sum_{e'_1, e'_2 \in \mathcal{S}} \textrm{max} (0, [\gamma + f_{align}(e_1, e_2) - f_{align}(e'_1, e'_2)]) \tag 1$ 其中

$\gamma$ > 0 is a margin hyper-parameter
$f_{align}$ the alignment score function
用于评估两个实体的对齐程度，即两个实体有相似度。两个实体越相似，得分越小。

常用的Loss函数

给定两个 n 维变量 $\textbf{x}(x_1,x_2,...,x_n)$ 和 $\textbf{y}(y_1,y_2...,y_n)$ ，使用以下两种方法进行计算：

曼哈顿距离公式
使用每个维度绝对值之和的方式表示：
$L_1:D_1(\overrightarrow{\textbf{xy}}) = ||\textbf{y}-\textbf{x}|| = |y_1-x_1| + |y_2-x_2| + ...+|y_n-x_n)| = \sum_{i=1}^n |y_i - x_i|$
欧式距离公式
使用每个维度的平均和再开根的方式表示：
$L_2:D_2(\overrightarrow{\textbf{xy}})= ||\textbf{y}-\textbf{x}|| = \sqrt{(y_1-x_1)^2 + (y_2-x_2)^2 + ...+(y_n-x_n)^2} = \sqrt{\sum_{i=1}^n (y_i - x_i)^2}$

max(0,) 函数

max函数保证负值不会考虑中。

In summary, the input features to the alignment module may be raw information such as KG structure, relation predicates, and attributes, as well as entity/relation/attribute alignments which may be created manually or automatically.

Bootstrapping

is a common strategy when limited seed alignments are available. The idea is that those aligned entity/attribute/relation produced by the EA inference module are fed back to the alignment module as training data, and this process may be iterated multiple times.

Note that creating seeds takes human effort, which is expensive.

Bootstrapping may help reduce human effort but is at the cost of much more computation since it iterates training multiple times.

3.3 EA Inference module

主要目标

This module aims to predict whether a pair of entities from G1 and G2 are aligned.
即，用于预测两个实体是否是相同的。

实际应用场景

Given a target entity $e_1$ from $\mathcal{G}_1$ , the EA inference module aims to predict an entity $e_2$ from $\mathcal{G}_2$ that is aligned to $e_1$ ; we may call $e_1$ ( $e_2$ ) the aligned entity or the counterpart entity of $e_2$ ( $e_1$ ). The aligned entity may not exist if a similarity threshold is applied.

NNS(nearest Neighbor Search)

NNS finds the entity $e_2$ from $\mathcal{G}_2$ that is the most similar to $e_1$ from $\mathcal{G}_1$ based on their embeddings obtained from the EA training module.

可能会引发 $many$ - $to$ - $one$ $alignment$ 问题即，有多个与 $e_1$ 相似的实体。解决的方法是通过 $one$ - $to$ - $one$ 限制。

3.4 总结

表2 常见方法与实体总结表

4 KG Structure Embedding Models

This section reviews the two paradigms of KG structure embedding, which is the core part of embedding-based EA techniques.

4.1 Translation-based Embedding Models

The essence of translation-based embedding models is treating a relation in KGs as a “translation” in a vector space between the head and the tail entities.

TransE [Bordes et al., 2013] is the first translation-based model, which embeds both entities and relations into a unified low-dimensional vector space.

Given a relation triple $(h, r, t)$ , TransE learns embeddings $h$ , $t$ , and $r$ of entities $h$ and $t$ and relation predicate $r$ such that $h+r \approx t$ . To realize this assumption in learning, a triple score function used for measuring the plausibility of a relation triple is defined as follows: $f_{tuple}(h,r,t)=||\boldsymbol{h}+\boldsymbol{r}-\boldsymbol{t}|| \tag 2$

Loss函数用于定义评估函数 $\mathcal{L} = \sum_{h,r,t \in \mathcal{T}_r} \sum_{h',r',t' \in \mathcal{T}'_r} \textrm{max} (0, [\gamma + f_{triple}(h, r, t) - f_{triple}(h', r', t')]) \tag 3$

4.2 GNN-based Embedding Models (GNNs)

GNN [Wu et al., 2021] have yielded strong performance on graph data analysis and gained immense popularity. 两种常见的模型：

Graph Convolutional Networks(GCNs) [Kipf and Welling, 2017]
Graph Attention Networks (GAT) [Velickovic et al., 2018]

主要特征： GNN-Based models focus on aggregating information from the neighborhood of entities together with the graph structure to compute entity embeddings.

Graph Convolutional Networks (GCNs)

GCNs compute a target node’s embeddings as a low-dimensional vector (i.e., embedding) by aggregating the features of its neighbors in addition to
itself, following the rules of message passing in graphs.

Graph Attention Networks (GAT)

GAT aggregate information from neighborhood with the attention mechanism [Vaswani et al., 2017] and allows for focusing on the most relevant neighbors.

5 Translation-based EA Techniques

These techniques are all based on TransE [Bordes et al., 2013] or its variants, which encodes KG structure by relation triples, paths or neighborhood.

5.1 Techniques that Only Use KG Structure

MTransE [Chen et al., 2017]
IPTransE [Zhu et al., 2017]
BootEA [Sun et al., 2018]
NAEA [Zhu et al., 2019]
TransEdge [Sun et al., 2019]
Other techniques that only use KG structure (omitted)

5.2 Techniques that Exploit Relation Predicates and Attributes

JAPE [Sun et al., 2017]
KDCoE [Chen et al., 2018]
AttrE [Trisedya et al., 2019]
MultiKE [Zhang et al., 2019
COTSAE [Yang et al., 2020]

6 GNN-based EA Techniques

GNNs suit KGs’ inherent graph structure and hence there is a trend of growing numbers of EA techniques based on GNNs in the last couple of years.

6.1 GCN-based EA Techniques

GCN-Align [Wang et al., 2018]
HGCN [Wu et al., 2019b]
RDGCN [Wu et al., 2019a]
GMNN [Xu et al., 2019a]
MuGNN [Cao et al., 2019]
NMN [Wu et al., 2020]
CEA [Zeng et al., 2020]
Other GCN-based EA techniques

6.2 GAT-based EA Techniques

KECG [Li et al., 2019]
AliNet [Sun et al., 2020a]
MRAEA [Mao et al., 2020]
EPEA [Wang et al., 2020]
AttrGNN [Liu et al., 2020]

7 Datasets and Experimental Studies

主要内容：

discuss the limitations of existing datasets and experimental studies
present our proposed datasets addressing the limitations
report on a comprehensive experimental study on representative EA techniques using our datasets

7.1 Limitations of Existing Datasets

Bijection
In most cases, such datasets consist of two KGs where almost every entity in one KG has one and only one aligned entity in the other KG, i.e., there is bijection between the two KGs. Such application instances are infrequent in real life.
Counterpart: We argue that the following scenario is more common: two KGs come from two different sources.
Lack of name variety
Most previous EA datasets are constructed from KGs with the same source.
Small scale
Most existing datasets are of small (e.g., MED-BBK-9K contains 9,162 unique entities) to medium (e.g., DBP-FB contains 29,861 unique entities) sizes. Our proposed datasets contain more than 100,000 unique entities from the two KGs combined.

表3 相关数据集对比

7.2 Limitations of Existing Experimental Studies

介绍了几个相关研究的问题，如配置问题，数据集限制，对比不充分等。同时表明文档的研究更充分。

7.3 Paper Proposed Datasets: DWY-NB

后面的内容主要是基于此实验，证明其数据集的合理性，解决了双向关联、名称不够多样和样本集合太小等问题。并开发了一系列的实验进行验证证明。由于这些内容过于具体，不涉及相关领域内容，帮不再展开介绍。

8 Conclusions and Future Directions

8.1 总结

文章的模型和样本更好。

8.2 未来研究方向

A few future directions may be followed based on the insights from the framework and experimental results.

In terms of further enriching the benchmark, more experiment settings can be explored such as varying the proportion of unaligned entities, proportion of entity with the same name (i.e., the proportion of the “tricky” feature), and the ratio between relation triples and attribute triples.
In terms of developing new EA techniques, unsupervised approach making use of more types of information in interesting ways is promising given that seeds alignments are expensive.
Embedding entities into other vector spaces (e.g., the manifold space and the complex vector space) is also an interesting direction.