I have two files:
I want to compare the the line[3] from fileA with line[1] from fileB.
flieA format:
1 i713426 0 726912 0 0
1 i713449 0 830731 0 0
1 i707010 0 1183442 0 A
1 i713034 0 1225231 0 G
1 i703639 0 1267327 I D
1 i713057 0 1425512 0 T
1 i713129 0 1501061 0 G
1 i707027 0 1542721 0 C
1 i713163 0 1680617 0 C
1 i707055 0 1884055 0 C
1 i713254 0 2145254 0 C
1 i713324 0 2486696 0 C
1 i6059967 0 2526746 G A
1 i713334 0 2626131 0 0
1 i713335 0 2692373 0 C
1 i713341 0 3043138 0 A
1 i707150 0 3216645 0 0
1 i713347 0 3277176 0 G
fileB fromat
chr1 87190 rs1524602 A/G 0.4358974358974359 0.8
chr1 87204 rs866881507 A/G 0.02564102564102564 0.2
chr1 87234 rs533355948 C/T 0.02564102564102564 0.2
chr1 87236 rs879825293 C/T 0.05128205128205128 0.2
chr1 87256 rs373216495 C/T 0.05128205128205128 0.6
chr1 87259 rs570089526 A/G 0.05128205128205128 0.6
chr1 87302 rs529420236 C/T 0.02564102564102564 0.2
chr1 87303 rs2103135 A/G 0.1282051282051282 0.4
chr1 87304 rs550004764 A/G 0.02564102564102564 0.2
chr1 87351 rs549570359 C/T 0.02564102564102564 0.2
chr1 87360 rs180907504 C/T 0.15384615384615385 0.6
chr1 87361 rs535266627 A/G 0.02564102564102564 0.4
chr1 87366 rs558417557 A/G 0.02564102564102564 0.4
chr1 87373 rs963638476 A/G 0.02564102564102564 0.2
chr1 87374 rs974579646 A/C 0.02564102564102564 0.2
output if line[3] from fileA is equal line[1] from fileB print
i713426 rs567161598
i713449 rs547376081
i707010 rs566056983
i713034 rs568184696
i703639 rs748522325
i713057 rs528436382
i713129 rs560208264
i707027 rs532649680
i713163 rs577119367
i707055 rs566696367
i713254 rs554477909
i713324 rs542280290
my code
with open('/////fileA','r') as bim:
with open ('////output.isec', 'w') as ic:
for k in bim:
l1 = k.split('\t')
size = len(str(l1[3]))
with open('/fileB', 'r') as file:
for m in file:
l2 = m.split('\t')
if len(l2[1]) != size:
continue
if l1[3] == l2[1]:
if l1[1] != l2[2]:
#print(l1[1],l2[2])
ic.write('{0}\t{1}\n'.format(l1[1],l2[2]))
break
The script (B(k^2)) takes approximately 30 min for 900 lines from fileA, how can modify my script to improve the time?
Since both the files are sorted, you only need to iterate through each one once, and read the next line of the file with a lower position. This should only take a few minutes:
bim = open('/////fileA','r')
ic = open('////output.isec', 'w')
file = open('/fileB', 'r')
bim_line = bim.readline()
line = file.readline()
while line and bim_line:
bim_split = bim_line.split("\t")
split = line.split("\t")
if bim_split[3] < split[1]:
bim_line = bim.readline()
elif split[1] < bim_split[3]:
line = file.readline()
else:
ic.write(bim_split[1] + "\t" + split[2] + "\n")
line = file.readline()
bim_line = bim.readline()
bim.close()
ic.close()
file.close()