200万条关系导入
郝伟 2021/02/19
在完成了 50万节点插入测试 测试以后,发现关系的插入与节点的插入有所不同,主要包括两点:
其中关于第2条,尤其重要,根据之前的测试,发现使用 Where 子句匹配时,需要花费很多时间,是一个需要重要考虑的内容。所以首先进行了关于索引的测试。
以下测试可以看出,使用索引能够有效提高节点的检索速度。
--从322750节点中找到名为郑西的人 MATCH (n:Person{name: "郑西"}) RETURN n LIMIT 25 --建立前: Started streaming 2 records after 1 ms and completed after 461 ms. --建立后: Started streaming 2 records after 2 ms and completed after 2 ms. --从322750节点中找到名为杨频珍的人 MATCH (n:Person{name: "杨频珍"}) RETURN n LIMIT 25 --建立前: Started streaming 1 records after 2 ms and completed after 426 ms. --建立后: Started streaming 1 records after 2 ms and completed after 3 ms. --从322750节点中找到名为田轻偷的人 MATCH (n:Person{name: "田轻偷"}) RETURN n LIMIT 25 --建立前: Started streaming 1 records after 2 ms and completed after 421 ms. --建立后: Started streaming 1 records after 2 ms and completed after 3 ms. --PS. 创建32.3万个索引的用是6ms,平均1ms创建5万多个。 Added 1 index, completed after 6 ms.
MATCH (a:Person), (b:Person) WHERE a.name = "谢维娅" AND b.name = "方白生" CREATE (a)-[f:Friends]->(b) RETURN a,b --创建前: Created 1 relationship, started streaming 1 records after 1078 ms and completed after 1078 ms. --创建后: Created 1 relationship, started streaming 1 records after 195 ms and completed after 196 ms. MATCH (a:Person), (b:Person) WHERE ID(a) = 159234 AND ID(b) = 326213 CREATE (a)-[f:Friends]->(b) RETURN a,b Created 1 relationship, started streaming 1 records after 4 ms and completed after 4 ms. MATCH (a:Person), (b:Person) WHERE a.id = 159235 AND b.id = 326213 CREATE (a)-[f:Friends]->(b) RETURN a,b --创建前: Created 1 relationship, started streaming 1 records after 815 ms and completed after 815 ms. --创建后: Created 1 relationship, started streaming 1 records after 2 ms and completed after 2 ms.
MATCH (a:Person), (b:Person) WHERE a.id = 159235 AND b.id = 3262141 CREATE (a)-[:Friends]->(b)
通过以上的测试,可以发现以下重要的结论:
所以,后面关系的插入也同样利用索引进行加速。另外,通过实验还发现 neo4j 需要一定量的内存,至少2G起步,多多益善。
核心是编写一个CQL生成函数 string get_cql(string name, int count, int max) 用于生成指定格式的字符串。具体作用就是用于处理图中的节点和关系的表示,其中name就是在输出中方括号内的名称,count决定了节点的数量,id后面的值是随机值,但是小于max。生成的插入关系是链式的,即 节点1--边1-->节点2--边2-->节点3...节点n_1--边n-->节点n 使用 n+1 个节点以生成 n 条关系。
以下是执行不同的参数得到的结果:
get_cql('Frineds', 2, 50000) 得到:
MATCH (n0:Person), (n1:Person) WHERE n0.id=471787 AND n1.id=103795 CREATE (n0)-[:Frineds]->(n1)
get_cql('Frineds', 3, 50000) 得到:
MATCH (n0:Person), (n1:Person), (n2:Person) WHERE n0.id=400454 AND n1.id=228240 AND n2.id=19799 CREATE (n0)-[:Frineds]->(n1)-[:Frineds]->(n2)
get_cql('Frineds', 4, 50000) 得到:
MATCH (n0:Person), (n1:Person), (n2:Person), (n3:Person) WHERE n0.id=115219 AND n1.id=451118 AND n2.id=402292 AND n3.id=25697 CREATE (n0)-[:Coworkers]->(n1)-[:Coworkers]->(n2)-[:Coworkers]->(n3)
get_cql('Frineds', 11, 50000) 得到:
MATCH (n0:Person), (n1:Person), (n2:Person), (n3:Person), (n4:Person), (n5:Person), (n6:Person), (n7:Person), (n8:Person), (n9:Person), (n10:Person) WHERE n0.id=482246 AND n1.id=309116 AND n2.id=173604 AND n3.id=317509 AND n4.id=478207 AND n5.id=233326 AND n6.id=203783 AND n7.id=187576 AND n8.id=265593 AND n9.id=208146 AND n10.id=69812 CREATE (n0)-[:Relatives]->(n1)-[:Relatives]->(n2)-[:Relatives]->(n3)-[:Relatives]->(n4)-[:Relatives]->(n5)-[:Relatives]->(n6)-[:Relatives]->(n7)-[:Relatives]->(n8)-[:Relatives]->(n9)-[:Relatives]->(n10)
本实验是插入50万条 Friends 关系实验,在增加了内存以后非常顺利,一次通过,这是日志文件。
实验开始于 15:05:19,结束于 15:19:44,仅用时14分25秒,平均插入速度为 578边/秒,速度曲线如下所示:
在本次实验完成以后,又添加了三种关系:'Coworkers', 'Relatives', 'Families',每种关系都是50万条。整个过程使用同样的代码,比较顺利完成,具体过程不再细表。
#coding:utf-8 # 1. 随机生成 50万个节点标签均为Person,包括 name, sex 和 age 三个属性 # 2. 随机生成 200万个关系,包括朋友,同事,同学和亲戚4种关系 from neo4j import GraphDatabase import random import time # id, name, sex(m/f), age(1-100) # with open('data.txt', 'w', encoding='utf-8') as f: # f.write('{0},{1},{2},{3}\n'.format(id, name, sex, age)) def create_nodes(tx, cql): return tx.run(cql) print('Connecting 121.196.157.14:7687 ...') driver=GraphDatabase.driver('bolt://121.196.157.14:7687', auth=('neo4j', 'hd7iu2_X@v4u')) # MATCH (n1:Person), (n2:Person), (n3:Person) WHERE # a.id = 159235 AND b.id = 326214 AND c.id = 13579 # CREATE (a)-[:Friends]->(b)-[:Friends]->(c) def get_cql(edge_name, batch_count=51): cql = 'MATCH ' cql += ', '.join(['(n{0}:Person)'.format(i) for i in range(batch_count)]) cql += ' WHERE ' cql += ' AND '.join(['n{0}.id={1}'.format(i, random.randint(1, maxsize)) for i in range(batch_count)]) cql += ' CREATE ' cql += ('-[:' + edge_name + ']->').join(['(n{0})'.format(i) for i in range(batch_count)]) return cql # 运行1万次,每次插入50个节点,总计50万个节点 maxsize=500000 total_times = 10000 batch_count = 50 f = open('log.txt', 'w') # 记录日志 print('Connected.') try: with driver.session() as session: print("Begin transctions...") for edge_name in ['Friends', 'Coworkers', 'Relatives', 'Families']: print('Begin processing ' + edge_name + ' ...') for runTimes in range(total_times): cql = get_cql(edge_name) start_time=time.time() session.write_transaction(create_nodes, cql) end_time=time.time() total_time=end_time - start_time # 每次插入用时 speed = batch_count / total_time # 插入速度 info = time.strftime("%Y/%m/%d %H:%M:%S", time.localtime()) + ' {0:>5d}: {1:6.3f}s, {2:6.2f} edges/s.'.format(runTimes, total_time, speed) print(info) f.write(info + "\n") except: pass f.close() print('\ndone')