论文精读:A Comprehensive Survey on Knowledge Graph Entity Alignment via Representation Learning
郝伟 2021/01/12
@article{DBLP:journals/corr/abs-2103-15059,
author = {Rui Zhang and
Bayu Distiawan Trisedya and
Miao Li and
Yong Jiang and
Jianzhong Qi},
title = {A Comprehensive Survey on Knowledge Graph Entity Alignment via Representation Learning},
journal = {CoRR},
volume = {abs/2103.15059},
year = {2021},
url = {https://arxiv.org/abs/2103.15059},
eprinttype = {arXiv},
eprint = {2103.15059},
timestamp = {Thu, 15 Jul 2021 12:17:32 +0200},
biburl = {https://dblp.org/rec/journals/corr/abs-2103-15059.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
注: CoRR(Computer Research Repository), 计算机研究领域的论文库, 是arXiv电子服务的一部分,同时只关注与计算机领域的内容. CoRR和arXiv一样,都是Cornell大学运营的。参考网址: https://journals.lww.com/clinorthop/pages/default.aspx
文章下载:地址
Rui Zhang, Tsinghua University, URL: https://ruizhang.info/, E-mail: rui.zhang@ieee.org
General Information
Dr Rui Zhang is an internationally leading researcher in the area of big data, data mining and machine learning. He is a Visiting Professor at Tsinghua University and was a Professor at the School of Computing and Information Systems of the University of Melbourne. Dr Zhang has won several awards including the prestigious Future Fellowship by the Australian Research Council in 2012, Chris Wallace Award for Outstanding Research by the Computing Research and Education Association of Australasia (CORE) in 2015, and Google Faculty Research Award in 2017. His inventions have been adopted by major IT companies such as Microsoft, Amazon and AT&T. Dr Rui Zhang obtained his Bachelor’s degree from Tsinghua University in 2001, PhD from National University of Singapore in 2006, and has then started as a faculty member in The University of Melbourne since 2007. Before joining the University of Melbourne, he has been a visiting research scientist at AT&T labs-research in New Jersey and at Microsoft Research in Redmond, Washington. He has also been a regular visiting researcher at Microsoft Research Asia in Beijing. Dr Zhang’s research interests include big data and AI, particularly in areas of recommendation systems, knowledge bases, chatbot, spatial and temporal data analytics, moving object management and data streams.
Bayu Distiawan Trisedya, The University of Melbourne
E-mail: bayu.trisedya@unimelb.edu.au
Miao Li
The University of Melbourne
E-mail: miao4@student.unimelb.edu.au
Yong Jiang
Tsinghua University
E-mail: jiangy@mail.sz.tsinghua.edu.cn
Jianzhong Qi
The University of Melbourne
E-mail: jianzhong.qi@unimelb.edu.au
Abstract In the last few years, the interest in knowledge bases has grown exponentially in both the research community and the industry due to their essential role in AI applications. Entity alignment is an important task for enriching knowledge bases. This paper provides a comprehensive tutorial-type survey on representative entity alignment techniques that use the new approach of representation learning. We present a framework for capturing the key characteristics of these techniques, propose two datasets to address the limitation of existing benchmark datasets, and conduct extensive experiments using the proposed datasets. The framework gives a clear picture of how the techniques work. The experiments yield important results about the empirical performance of the techniques and how various factors affect the performance. One important observation not stressed by previous work is that techniques making good use of attribute triples and relation predicates as features stand out as winners.
摘要 在过去几年中,由于知识库在人工智能应用中的重要作用,研究界和工业界对知识库的兴趣呈指数级增长。实体对齐是丰富知识库的一项重要任务。本文提供了关于使用新的表示学习方法的代表性实体对齐技术的综合教程式调查。我们提出了一个框架来捕捉这些技术的关键特征,提出两个数据集来解决现有基准数据集的局限性,并使用所提出的数据集进行广泛的实验。该框架清晰地展示了这些技术的工作原理。实验产生了关于技术的经验性能以及各种因素如何影响性能的重要结果。以前的工作没有强调的一个重要观察是,充分利用属性三元组和关系谓词作为特征的技术作为赢家脱颖而出。
Knowledge bases are a technology used to store complex structured and unstructured information, typically facts or knowledge.
A knowledge graph (KG), which is a knowledge base modeled by a graph structure or topology, is the most popular form of knowledge bases and
has almost become a synonym of knowledge base today.
One of the most important tasks for KGs is entity alignment, which aims to identify entities from different KGs that represent the same real-world entity.
Different KGs may be created via different sources and methods, so even entities representing the same real-world entity may be denoted differently in different KGs, and it is challenging to identify all such aligned entities accurately
文章内容:
(i) a comprehensive tutorial-type survey to help readers understand how each technique works with little need to refer to their full papers,
(ii) an up-to-date framework with new insights and a structure that captures the latest techniques, and
(iii) analysis on how different techniques compare to each other.
此外,还有对一些数据集的分析。
其中:
表1 常用符号一览表
Given two KGs and , EA aims to identify every pair of
entities , , , where and represent the same real-world entity (i.e., e1 and e2 are
aligned entities).
简单来说,实体对齐找到两条在不同的数据集中,但是是对相同一实体的描述记录。
由于EA是从两个库中进行比对,所以主要问题根据库是否结构化,可以分为以下两类:
总结:Traditional EA techniques, as exemplified above, usually use data mining or database approaches, typically heuristics, to identify similar entities. It is difficult for them to achieve high accuracy and to generalize.
图1 基于实体嵌入的EA框架 (虚线表可选模块)
Embedding module 用于学习矢量表示(通常是低维度),也称为 embeddings of entities,主要包括以下四种类型:
机器学习以矢量为基础,GNN被用来进行学习。
主要目标:The embedding module computes the embeddings of each KG separately, which makes the embeddings of G1 and G2 fall into difffferent vector spaces.
主要目标:The alignment module aims to unify the embeddings of the two KGs into the same vector space so that aligned entities can be identified, which is a major challenge for EA.
最常见的方法是使用种子集合 ,满足以下条件:
Loss函数用于定义评估函数其中
给定两个 n 维变量 和 ,使用以下两种方法进行计算:
曼哈顿距离公式
使用每个维度绝对值之和的方式表示:
欧式距离公式
使用每个维度的平均和再开根的方式表示:
max函数保证负值不会考虑中。
In summary, the input features to the alignment module may be raw information such as KG structure, relation predicates, and attributes, as well as entity/relation/attribute alignments which may be created manually or automatically.
is a common strategy when limited seed alignments are available. The idea is that those aligned entity/attribute/relation produced by the EA inference module are fed back to the alignment module as training data, and this process may be iterated multiple times.
Note that creating seeds takes human effort, which is expensive.
Bootstrapping may help reduce human effort but is at the cost of much more computation since it iterates training multiple times.
This module aims to predict whether a pair of entities from G1 and G2 are aligned.
即,用于预测两个实体是否是相同的。
Given a target entity from , the EA inference module aims to predict an entity from that is aligned to ; we may call () the aligned entity or the counterpart entity of (). The aligned entity may not exist if a similarity threshold is applied.
NNS finds the entity from that is the most similar to from based on their embeddings obtained from the EA training module.
可能会引发 -- 问题即,有多个与相似的实体。解决的方法是通过 -- 限制。
表2 常见方法与实体总结表
This section reviews the two paradigms of KG structure embedding, which is the core part of embedding-based EA techniques.
The essence of translation-based embedding models is treating a relation in KGs as a “translation” in a vector space between the head and the tail entities.
TransE [Bordes et al., 2013] is the first translation-based model, which embeds both entities and relations into a unified low-dimensional vector space.
Given a relation triple , TransE learns embeddings , , and of entities and and relation predicate such that . To realize this assumption in learning, a triple score function used for measuring the plausibility of a relation triple is defined as follows:
Loss函数用于定义评估函数
GNN [Wu et al., 2021] have yielded strong performance on graph data analysis and gained immense popularity. 两种常见的模型:
主要特征: GNN-Based models focus on aggregating information from the neighborhood of entities together with the graph structure to compute entity embeddings.
GCNs compute a target node’s embeddings as a low-dimensional vector (i.e., embedding) by aggregating the features of its neighbors in addition to
itself, following the rules of message passing in graphs.
GAT aggregate information from neighborhood with the attention mechanism [Vaswani et al., 2017] and allows for focusing on the most relevant neighbors.
These techniques are all based on TransE [Bordes et al., 2013] or its variants, which encodes KG structure by relation triples, paths or neighborhood.
GNNs suit KGs’ inherent graph structure and hence there is a trend of growing numbers of EA techniques based on GNNs in the last couple of years.
KECG [Li et al., 2019]
AliNet [Sun et al., 2020a]
MRAEA [Mao et al., 2020]
EPEA [Wang et al., 2020]
AttrGNN [Liu et al., 2020]
主要内容:
表3 相关数据集对比

介绍了几个相关研究的问题,如 配置问题,数据集限制,对比不充分等。同时表明文档的研究更充分。
后面的内容主要是基于此实验,证明其数据集的合理性,解决了双向关联、名称不够多样和样本集合太小等问题。并开发了一系列的实验进行验证证明。由于这些内容过于具体,不涉及相关领域内容,帮不再展开介绍。
文章的模型和样本更好。
A few future directions may be followed based on the insights from the framework and experimental results.
In terms of further enriching the benchmark, more experiment settings can be explored such as varying the proportion of unaligned entities, proportion of entity with the same name (i.e., the proportion of the “tricky” feature), and the ratio between relation triples and attribute triples.
In terms of developing new EA techniques, unsupervised approach making use of more types of information in interesting ways is promising given that seeds alignments are expensive.
Embedding entities into other vector spaces (e.g., the manifold space and the complex vector space) is also an interesting direction.