一种文件平均分配方法及Python验证
郝伟 2022/06/01
在大数处理中,往往会有海量的文件数据需要处理。通用会将这些文件分配给多台机器进行并行处理。在分配时,往往采用随机的方式进行文件分配,以保证文件容量分配的均匀性。但是这种随机的方式,在文件大小倾斜(即有的部分文件相对其他文件特别大)时难以保证数据能够较好地均匀进行分配。
本发明提出了一种文件大小均衡的分配方式,能够将文件以尽可能平均的方式分配出去。经实践验证,本方法具有分配非常均匀,计算性能高,实时性好等优点,能够对文件分配进行显著地优化。
本算法的核心思路是采取劣势者优先分配的原则,即每次由所有分组中,拥有文件总容量最小的组,从当前剩余文件中选择1个最合适的文件。然后不断循环,直到所有文件都分配完成。
1.初始共m个文件作为待选文件,每个文件记为fm,其大小为size(fm);
2.建立n个文件列表表示n个分组,第i组文件记为groupi,其所有的文件大小之和为sum(group_i);
3.对m个文件进行n个分组后,每个分组的平均容量记为:
4.每次让n个分组中的groupx (x=1,…,n),从待选文件中选择1个合适的文件fy,放到自己的列表中,其中
(1)groupx选择的依据是x的文件列表中的文件大小之和相对其他分组最小,即 sum(groupx) < sum(goup!x);
(2)文件fy选择的依据是满足以下条件:
size(fy) = min({ abs(sum(groupx) + size(fm) - average)})
5.进行m次遍历,直到所有文件都分配完成。
在以下示例中,给定输入 [100, 100, 70, 60, 90, 30, 110, 2, 3, 4, 5, 6, 7, 8, 9, 10],分组数从1到6,可以看出分配后,虽然每组的数量不同,但是总和(即total) 非常接近。
================ data.size = 16 Groups = 1 =================
group 0: total=614, size=16 [110, 100, 100, 90, 70, 60, 30, 10, 9, 8]
================ data.size = 16 Groups = 2 =================
group 0: total=306, size=8 [110, 90, 70, 10, 9, 8, 5, 4]
group 1: total=308, size=8 [100, 100, 60, 30, 7, 6, 3, 2]
================ data.size = 16 Groups = 3 =================
group 0: total=204, size=4 [110, 60, 30, 4]
group 1: total=205, size=7 [100, 70, 10, 9, 8, 5, 3]
group 2: total=205, size=5 [100, 90, 7, 6, 2]
================ data.size = 16 Groups = 4 =================
group 0: total=146, size=6 [110, 10, 9, 8, 5, 4]
group 1: total=148, size=6 [100, 30, 7, 6, 3, 2]
group 2: total=170, size=2 [100, 70]
group 3: total=150, size=2 [90, 60]
================ data.size = 16 Groups = 5 =================
group 0: total=121, size=3 [110, 6, 5]
group 1: total=122, size=5 [100, 9, 8, 3, 2]
group 2: total=121, size=4 [100, 10, 7, 4]
group 3: total=120, size=2 [90, 30]
group 4: total=130, size=2 [70, 60]
================ data.size = 16 Groups = 6 =================
group 0: total=100, size=1 [100]
group 1: total=100, size=1 [100]
group 2: total=110, size=1 [110]
group 3: total=101, size=3 [90, 6, 5]
group 4: total=102, size=6 [70, 10, 9, 8, 3, 2]
group 5: total=101, size=4 [60, 30, 7, 4]
在实际测试中,我们使用了720个总容量为7.18TB的文件分为10组,原始数据见文件尾(仅显示文件大小,单位为字节).
#################################################################### # Date: 2022/06/01 # Author: Wei HAO # Description: this program is to show an algorithm that divides # a number of values into 5 subgroups with their accumulation of # divided values as same as possible, such as: # [Input values] # [110, 100, 100, 90, 70, 60, 30, 10, 9, 8, 7, 6, 5, 4, 3, 2] # [Divided sub groups] # group 0: total=121, size=3 [110, 6, 5] # group 1: total=122, size=5 [100, 9, 8, 3, 2] # group 2: total=121, size=4 [100, 10, 7, 4] # group 3: total=120, size=2 [90, 30] # group 4: total=130, size=2 [70, 60] #################################################################### class BalancedGroup: ''' Represents a group of values ''' average = 0 def __init__(self,group_id): self.id = group_id self.total = 0 self.values = [] def to_str(self): ''' to a string in format: "group 0: total=121, size=3 [110, 6, 5]" ''' return 'group {0}: total={1}, size={2} {3}'.format(self.id, self.total, len(self.values), self.values[:10]) def addValue(self, v): ''' appends v to values and accumulate v to total ''' self.values.append(v) self.total += v def select(self, values): ''' Select a value v from values, where v satisfies the following condition: v = min({ abs(bg.total + vs - BalancedGroup.average) | vs in values, bg in bgs}) ''' min = 1E10 pos = 0 for i in range(len(values)): offset = abs(self.total + values[i] - BalancedGroup.average) if offset < min: pos = i min = offset v = values[pos] self.values.append(values[pos]) self.total += values[pos] del(values[pos]) def divide(vs, group_count): data = [i for i in vs] print(' data.size = {} Groups = {} '.format(len(data), group_count).center(60, "=")) # input values in descending order data.sort(reverse = True) # print(data) # assign 5 groups for demostrating bgs = [BalancedGroup(i) for i in range(group_count)] BalancedGroup.average = sum(data) / len(bgs) for _ in range(len(data)): # select a number for data and add it to the group with the least total bgs.sort(key = lambda g : g.total) bgs[0].select(data) # print the results bgs.sort(key = lambda g : g.id) for bg in bgs: print(bg.to_str()) print() vs = [100, 100, 70, 60, 90, 30, 110, 2, 3, 4, 5, 6, 7, 8, 9, 10] print(vs) for i in range(6): divide(vs, i + 1)
以下是基于真实数据
# Extension: the values below are from real file szies of a project to be divided in 5 subgroups
vs = [59457476038,126743000000,232428000000,105928413,346029690,7703020704,85148347543,104337119,148141000000,11257293691,1792923609,22340965878,100805312,235369862,24278504,6643249135,20509042923,7113846,34232304808,45150982,15848738,4132463819,21814312931,19474768,14117864247,67520309327,241665747,3038093,1869594271,247366146,4695609,2416887,65273575,471965946,14485209,1488335,2933571,5989087,129241768,445109746,574288946,76536145,193497988,1648700010,2368595904,326419,4139814762,29421241,12233918,2607505,6809164011,15713204,296097,2168482,938069008,1829439015,4424343,859787,909709572,2684190749,19746476258,19698184220,19516933340,19702154694,19530986132,19349101599,19638568079,19675363046,19611299207,19557739950,19611610008,4489649362,19731480581,19445158816,38471001577,38098352507,38960387425,38056177350,38676587523,38650486316,39166251600,38546978580,38090632438,38426304223,38098735460,38932826578,38445273647,37802054672,5237488556,4280,3311723,7343,11309906,8673,1320895,7396,18091,7280397,74971,1717952,370073,2436,2370,448022,2657918,2817,169068,54739103,172419,10060135928,92967,1522818624,2271562695,5460067792,1687769590,865673,1052506402,778181598,496372,269879,391899,87294018,411984,161248000000,12430136,7429904,165900568,4305695890,1249989618,1585145233,7408080562,8357496,7747637,18391494143,2251552,17384997,4243868657,1050711389,9671440815,11355613,12771976,5548686984,33707269101,62575925,2059637773,4822710510,111935191,2350923632,134108248,179885278,37627690,509479672,9026542149,639354328,118589215,58998993,24268964,5396438,1216114930,108088000000,8064,141458043,147981,246238717,597368,412213269,57157415,111541015,11607,135857,43937,1760241,8301,7478,371062459,52400571,329666690,51605,3229312674,16522183,24037767,279449413,3896176,5407834956,3806144961,1453672591,115454,11692440536,3076499372,501741,5200550,8682055,1775613301,7632306973,182409,199062035,463755,21313,202992,66568769,273026,2953705,54747708,43229113,244271,31000171,90072,51970860,161623,39508114,42214573,986963190,124391759,380886777,59302031,184634938,422712441,933780,41147,100329,6663819,914664,362735,36621585,206548,64362665,35551,755701,1000111,690009,1312094,215264,4083182,1059075,829034365,2127370,822355,1524211,361015102,67147891,5371594321,940537260,4546258,475885236,585321702,421102803,1990168678,447357,198871,99290914,1196442,450765046,194595694,1233979433,153847,20890634,61971673,803567288,166558,247101,39554046,299015,136321,493601,129937125,13362,5467082,64742,34212,34361,2853,172867,2980958,71550,2750797,5123540,9644,34431041,11277909,28067706,109309,226278,652037892,13054336087,3168363,11040196,18868994824,26753190,6966674510,269314189,46070074225,22012559522,18886827121,24775866865,9371589,1179412,3654882,5140768,9435166,29461994,864575375,54372183094,1235861101,175830,371151844,54744,24027,188944,20754305,41758854,772064218,139539117,254669,40335750,1214926682,39280,25188053,19206,592321,1422727729,8771567,14248575,6471514,5098499601,2626208592,378771664,432294254,6687231,2374705,16609597,875455029,1205392096,4763191958,283167707,9600793,4073299,1935462853,469993458,879760602,155090,1287471,1478240943,2690987274,1263574875,2779365748,225602,167478,639926,5290607,1155550,224854025,77779167,30200754,282612,1386187471,1300737388,176990302,3730996,311313158,7848250674,9019308966,91246929,22214245,33412259,65426176089,3617561513,502245881,8553801336,4374557363,65155169,3592570578,617169720,1052874024,10401803,201846,2605003616,535896,4908793,1268349887,218827374,1112782,503444782,109183,30481603,4234866,274815640,1705705,473148722,65048154472,162484875,10355434590,72802392,101289000000,1300990435,87525294542,3869692011,22825095504,14697394174,76116852040,108423614,5013786444,28381717589,456218438,145131538,24884553543,436197,5135711799,2076141885,1640302,400799,333575798,648068,848349768,4321090310,649646,3134980,914971002,1014394652,144178023,2237858,663415,36156497,15035888939,24007460749,7860017227,2546836708,15479288,10088014796,143611220,46643206,26054967,101215980,54486062344,2190079444,749527705,3057569173,101794796,705375914,15385297157,11706644,31846,10276480,2087,523164969,36391443,168951491,49632,72110250,6186718,7161,757908800,162939142,12559067,108390,59203416,19309705781,14440026,2474897307,24995097,51520198,9514970139,4579896439,37281771,257070733,43843017270,808680174,15375369160,2161326750,27380920203,4256963372,1573372648,16729685,7254670,3592,7272956,68752607,15006,14640393,31875,97561,17444862,330963,145025,5333941,11144259,1309685,33346935,46918,3975,631014,318627,74033473,1588291,4706810,10487,3032,1410909,634673,4534,229983,456414102,42443443,69498895,57526339,11585462,2550472006,5837231923,1688363,16618333,11354962008,12714164378,23230714783,58142005,4152733416,2417873671,145333214,15992352232,55987645509,107507000000,31324697,12162718488,3727371536,388120613,8426960,19375786,67816568,4733388507,3738726372,1111719444,62245198,2599746863,3157832865,1276295276,143555317,10842696,26849490,14940678,542551,9321,2178,3399037,14888424,163329344,88680388,170270923,104640363,191309,308380,40774039,323379,1150,197445,382633548,15727754,5582679,2558670159,10386296014,5889436,38438130259,6777108502,253243400,37914389,24377706,21447607388,3400262,1878516002,37611342060,10072191,2573091538,12009072,72255761265,48338285633,353080402,96351,1211071,213635674,76674352801,22514102,1727857997,10860564507,58617816052,25977684603,467415000000,25982693181,9055069954,232825000000,38247797184,8059298446,8631,73438511656,24204851694,143014000000,10357132416,14852306951,14416752202,158747000000,278031,6270499367,972737,14650131,52802830067,53872904,215328159,12269335737,259757648,511874854,376185,184044,44867552474,28551312,227977578,5329803301,37112415,5325739230,18409873909,483347359,10677723029,26611472663,12044092,48273422300,97296019033,916167885,75513198,1347462138,903708,83594230,1343005101,1375067,31411554062,78072823404,212006813,261119879,3672610207,829285,31554044010,136544887,30862821844,12233840108,17871412779,213942,16078291,464182794,630532,141243571,4857610719,4228304,36597146,459597835,2338200306,6904652437,8184129,32113456,127550511,34025,99035265,287571237,167824,699484167,584412869,3374,2055240,183316304,124768,6292503163,18111406986,33005800,141929000000,148612000000,1130595576,22389131455,67309344511,176037512,2223193749,3255981916,731587,29425254849,40725770294,49816773,102511000000,112173000000,42953115614,1623634719,3200928499,27213807,12305626763,17362971653,215171780,924071604,29255199,16718933149,24905627776,96896292744,537320147,89376078,20710251824,338036,123294766,1440598784,12029,18593925940,29804573273,122967982,1938263,17567833,22475,30923453983,68384344823,509080662,1600740381,5815121417,42242753,24625790687,61616164685,4251646659,10115880255,3305127,334796000000,536879000000,359143886,120281000000]
divide(vs, 10)