论文精读:A Comprehensive Survey on Knowledge Graph Entity Alignment via Representation Learning
郝伟 2021/01/12

文章基本信息

@article{DBLP:journals/corr/abs-2103-15059,
  author    = {Rui Zhang and
               Bayu Distiawan Trisedya and
               Miao Li and
               Yong Jiang and
               Jianzhong Qi},
  title     = {A Comprehensive Survey on Knowledge Graph Entity Alignment via Representation Learning},
  journal   = {CoRR},
  volume    = {abs/2103.15059},
  year      = {2021},
  url       = {https://arxiv.org/abs/2103.15059},
  eprinttype = {arXiv},
  eprint    = {2103.15059},
  timestamp = {Thu, 15 Jul 2021 12:17:32 +0200},
  biburl    = {https://dblp.org/rec/journals/corr/abs-2103-15059.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

注: CoRR(Computer Research Repository), 计算机研究领域的论文库, 是arXiv电子服务的一部分,同时只关注与计算机领域的内容. CoRR和arXiv一样,都是Cornell大学运营的。参考网址: https://journals.lww.com/clinorthop/pages/default.aspx

文章下载:地址

作者信息

摘要

Abstract In the last few years, the interest in knowledge bases has grown exponentially in both the research community and the industry due to their essential role in AI applications. Entity alignment is an important task for enriching knowledge bases. This paper provides a comprehensive tutorial-type survey on representative entity alignment techniques that use the new approach of representation learning. We present a framework for capturing the key characteristics of these techniques, propose two datasets to address the limitation of existing benchmark datasets, and conduct extensive experiments using the proposed datasets. The framework gives a clear picture of how the techniques work. The experiments yield important results about the empirical performance of the techniques and how various factors affect the performance. One important observation not stressed by previous work is that techniques making good use of attribute triples and relation predicates as features stand out as winners.

摘要 在过去几年中,由于知识库在人工智能应用中的重要作用,研究界和工业界对知识库的兴趣呈指数级增长。实体对齐是丰富知识库的一项重要任务。本文提供了关于使用新的表示学习方法的代表性实体对齐技术的综合教程式调查。我们提出了一个框架来捕捉这些技术的关键特征,提出两个数据集来解决现有基准数据集的局限性,并使用所提出的数据集进行广泛的实验。该框架清晰地展示了这些技术的工作原理。实验产生了关于技术的经验性能以及各种因素如何影响性能的重要结果。以前的工作没有强调的一个重要观察是,充分利用属性三元组和关系谓词作为特征的技术作为赢家脱颖而出。

1 Introduction

1.1 定义

知识库 (Knowledge Base, KB)

Knowledge bases are a technology used to store complex structured and unstructured information, typically facts or knowledge.

知识图谱 (Knowledge Graph, KG)

A knowledge graph (KG), which is a knowledge base modeled by a graph structure or topology, is the most popular form of knowledge bases and
has almost become a synonym of knowledge base today.

实体对齐 (Entity Alignment, EA)

One of the most important tasks for KGs is entity alignment, which aims to identify entities from different KGs that represent the same real-world entity.

1.2 意义

Different KGs may be created via different sources and methods, so even entities representing the same real-world entity may be denoted differently in different KGs, and it is challenging to identify all such aligned entities accurately

1.3 EA传统方法

1.4 文章主要内容

文章内容:
(i) a comprehensive tutorial-type survey to help readers understand how each technique works with little need to refer to their full papers,
(ii) an up-to-date framework with new insights and a structure that captures the latest techniques, and
(iii) analysis on how different techniques compare to each other.
此外,还有对一些数据集的分析。

2 基本知识

2.1 常用符号

G=(E,R,A,V,T)\mathcal{G} = (\mathcal{E}, \mathcal{R}, \mathcal{A}, \mathcal{V}, \mathcal{T})其中:

表1 常用符号一览表

2.2 EA数学定义

Given two KGs G1=(E1,R1,A1,V1,T1)\mathcal{G}_1 = (\mathcal{E}_1, \mathcal{R}_1, \mathcal{A}_1, \mathcal{V}_1, \mathcal{T}_1) and G2=(E2,R2,A2,V2,T2)\mathcal{G}_2 = (\mathcal{E}_2, \mathcal{R}_2, \mathcal{A}_2, \mathcal{V}_2, \mathcal{T}_2), EA aims to identify every pair of
entities (e1,e2)(e_1, e_2), e1E1e_1 \in \mathcal{E}_1, e2E1e_2 \in \mathcal{E}_1, where e1e_1 and e2e_2 represent the same real-world entity (i.e., e1 and e2 are
aligned entities).

简单来说,实体对齐找到两条在不同的数据集中,但是是对相同一实体的描述记录。

2.3 相关问题

由于EA是从两个库中进行比对,所以主要问题根据库是否结构化,可以分为以下两类:

2.4 Traditional Techniques for EA

总结:Traditional EA techniques, as exemplified above, usually use data mining or database approaches, typically heuristics, to identify similar entities. It is difficult for them to achieve high accuracy and to generalize.

3.1 Generic Framework of Embedding-based EA

图1 基于实体嵌入的EA框架 (虚线表可选模块)

3.1 Embedding module

Embedding module 用于学习矢量表示(通常是低维度),也称为 embeddings of entities,主要包括以下四种类型:

机器学习以矢量为基础,GNN被用来进行学习。

主要目标:The embedding module computes the embeddings of each KG separately, which makes the embeddings of G1 and G2 fall into difffferent vector spaces.

3.2 Alignment module

主要目标:The alignment module aims to unify the embeddings of the two KGs into the same vector space so that aligned entities can be identified, which is a major challenge for EA.

最常见的方法是使用种子集合 S\mathcal{S},满足以下条件:S={(e1,e2)e1E1,e2E2,e1e2}\mathcal{S} = \{(e_1, e_2) | e_1 \in \mathcal{E}_1, e_2 \in \mathcal{E}_2, e_1 \equiv e_2\}

Loss函数 falignf_{align}

Loss函数用于定义评估函数L=e1,e2Se1,e2Smax(0,[γ+falign(e1,e2)falign(e1,e2)])(1)\mathcal{L} = \sum_{e_1, e_2 \in \mathcal{S}} \sum_{e'_1, e'_2 \in \mathcal{S}} \textrm{max} (0, [\gamma + f_{align}(e_1, e_2) - f_{align}(e'_1, e'_2)]) \tag 1其中

常用的Loss函数

给定两个 n 维变量 x(x1,x2,...,xn)\textbf{x}(x_1,x_2,...,x_n)y(y1,y2...,yn)\textbf{y}(y_1,y_2...,y_n),使用以下两种方法进行计算:

max(0,) 函数

max函数保证负值不会考虑中。

In summary, the input features to the alignment module may be raw information such as KG structure, relation predicates, and attributes, as well as entity/relation/attribute alignments which may be created manually or automatically.

Bootstrapping

is a common strategy when limited seed alignments are available. The idea is that those aligned entity/attribute/relation produced by the EA inference module are fed back to the alignment module as training data, and this process may be iterated multiple times.

Note that creating seeds takes human effort, which is expensive.

Bootstrapping may help reduce human effort but is at the cost of much more computation since it iterates training multiple times.

3.3 EA Inference module

主要目标

This module aims to predict whether a pair of entities from G1 and G2 are aligned.
即,用于预测两个实体是否是相同的。

实际应用场景

Given a target entity e1e_1 from G1\mathcal{G}_1, the EA inference module aims to predict an entity e2e_2 from G2\mathcal{G}_2 that is aligned to e1e_1; we may call e1e_1 (e2e_2) the aligned entity or the counterpart entity of e2e_2 (e1e_1). The aligned entity may not exist if a similarity threshold is applied.

NNS finds the entity e2e_2 from G2\mathcal{G}_2 that is the most similar to e1e_1 from G1\mathcal{G}_1 based on their embeddings obtained from the EA training module.

可能会引发 manymany-toto-oneone alignmentalignment 问题即,有多个与e1e_1相似的实体。解决的方法是通过 oneone-toto-oneone 限制。

3.4 总结

表2 常见方法与实体总结表

4 KG Structure Embedding Models

This section reviews the two paradigms of KG structure embedding, which is the core part of embedding-based EA techniques.

4.1 Translation-based Embedding Models

The essence of translation-based embedding models is treating a relation in KGs as a “translation” in a vector space between the head and the tail entities.

TransE [Bordes et al., 2013] is the first translation-based model, which embeds both entities and relations into a unified low-dimensional vector space.

Given a relation triple (h,r,t)(h, r, t), TransE learns embeddings hh, tt, and rr of entities hh and tt and relation predicate rr such that h+rth+r \approx t. To realize this assumption in learning, a triple score function used for measuring the plausibility of a relation triple is defined as follows:ftuple(h,r,t)=h+rt(2)f_{tuple}(h,r,t)=||\boldsymbol{h}+\boldsymbol{r}-\boldsymbol{t}|| \tag 2

Loss函数用于定义评估函数L=h,r,tTrh,r,tTrmax(0,[γ+ftriple(h,r,t)ftriple(h,r,t)])(3)\mathcal{L} = \sum_{h,r,t \in \mathcal{T}_r} \sum_{h',r',t' \in \mathcal{T}'_r} \textrm{max} (0, [\gamma + f_{triple}(h, r, t) - f_{triple}(h', r', t')]) \tag 3

4.2 GNN-based Embedding Models (GNNs)

GNN [Wu et al., 2021] have yielded strong performance on graph data analysis and gained immense popularity. 两种常见的模型:

主要特征: GNN-Based models focus on aggregating information from the neighborhood of entities together with the graph structure to compute entity embeddings.

Graph Convolutional Networks (GCNs)

GCNs compute a target node’s embeddings as a low-dimensional vector (i.e., embedding) by aggregating the features of its neighbors in addition to
itself, following the rules of message passing in graphs.

Graph Attention Networks (GAT)

GAT aggregate information from neighborhood with the attention mechanism [Vaswani et al., 2017] and allows for focusing on the most relevant neighbors.

5 Translation-based EA Techniques

These techniques are all based on TransE [Bordes et al., 2013] or its variants, which encodes KG structure by relation triples, paths or neighborhood.

5.1 Techniques that Only Use KG Structure

5.2 Techniques that Exploit Relation Predicates and Attributes

6 GNN-based EA Techniques

GNNs suit KGs’ inherent graph structure and hence there is a trend of growing numbers of EA techniques based on GNNs in the last couple of years.

6.1 GCN-based EA Techniques

6.2 GAT-based EA Techniques

KECG [Li et al., 2019]
AliNet [Sun et al., 2020a]
MRAEA [Mao et al., 2020]
EPEA [Wang et al., 2020]
AttrGNN [Liu et al., 2020]

7 Datasets and Experimental Studies

主要内容:

7.1 Limitations of Existing Datasets

表3 相关数据集对比

7.2 Limitations of Existing Experimental Studies

介绍了几个相关研究的问题,如 配置问题,数据集限制,对比不充分等。同时表明文档的研究更充分。

7.3 Paper Proposed Datasets: DWY-NB

后面的内容主要是基于此实验,证明其数据集的合理性,解决了双向关联、名称不够多样和样本集合太小等问题。并开发了一系列的实验进行验证证明。由于这些内容过于具体,不涉及相关领域内容,帮不再展开介绍。

8 Conclusions and Future Directions

8.1 总结

文章的模型和样本更好。

8.2 未来研究方向

A few future directions may be followed based on the insights from the framework and experimental results.