Search code examples
pythonchunks

Parse big file for found values using python


I have two files:

  • fileA with 20,000 lines
  • fileB with 16000000 of lines

I want to compare the the line[3] from fileA with line[1] from fileB.

flieA format:

1       i713426 0       726912  0       0
1       i713449 0       830731  0       0
1       i707010 0       1183442 0       A
1       i713034 0       1225231 0       G
1       i703639 0       1267327 I       D
1       i713057 0       1425512 0       T
1       i713129 0       1501061 0       G
1       i707027 0       1542721 0       C
1       i713163 0       1680617 0       C
1       i707055 0       1884055 0       C
1       i713254 0       2145254 0       C
1       i713324 0       2486696 0       C
1       i6059967        0       2526746 G       A
1       i713334 0       2626131 0       0
1       i713335 0       2692373 0       C
1       i713341 0       3043138 0       A
1       i707150 0       3216645 0       0
1       i713347 0       3277176 0       G

fileB fromat

chr1    87190   rs1524602   A/G 0.4358974358974359  0.8
chr1    87204   rs866881507 A/G 0.02564102564102564 0.2
chr1    87234   rs533355948 C/T 0.02564102564102564 0.2
chr1    87236   rs879825293 C/T 0.05128205128205128 0.2
chr1    87256   rs373216495 C/T 0.05128205128205128 0.6
chr1    87259   rs570089526 A/G 0.05128205128205128 0.6
chr1    87302   rs529420236 C/T 0.02564102564102564 0.2
chr1    87303   rs2103135   A/G 0.1282051282051282  0.4
chr1    87304   rs550004764 A/G 0.02564102564102564 0.2
chr1    87351   rs549570359 C/T 0.02564102564102564 0.2
chr1    87360   rs180907504 C/T 0.15384615384615385 0.6
chr1    87361   rs535266627 A/G 0.02564102564102564 0.4
chr1    87366   rs558417557 A/G 0.02564102564102564 0.4
chr1    87373   rs963638476 A/G 0.02564102564102564 0.2
chr1    87374   rs974579646 A/C 0.02564102564102564 0.2

output if line[3] from fileA is equal line[1] from fileB print

i713426 rs567161598
i713449 rs547376081
i707010 rs566056983
i713034 rs568184696
i703639 rs748522325
i713057 rs528436382
i713129 rs560208264
i707027 rs532649680
i713163 rs577119367
i707055 rs566696367
i713254 rs554477909
i713324 rs542280290

my code

with open('/////fileA','r') as bim:
    with open ('////output.isec', 'w') as ic:
        for k in bim:
            l1 = k.split('\t')
            size = len(str(l1[3]))
            with open('/fileB', 'r') as file:
                for m in file:
                    l2 = m.split('\t')
                    if len(l2[1]) != size:
                        continue
                    if l1[3] == l2[1]:
                        if l1[1] != l2[2]:
                            #print(l1[1],l2[2])
                            ic.write('{0}\t{1}\n'.format(l1[1],l2[2]))
                        break 

The script (B(k^2)) takes approximately 30 min for 900 lines from fileA, how can modify my script to improve the time?


Solution

  • Since both the files are sorted, you only need to iterate through each one once, and read the next line of the file with a lower position. This should only take a few minutes:

    bim = open('/////fileA','r')
    ic = open('////output.isec', 'w')
    file = open('/fileB', 'r')
    bim_line = bim.readline()
    line = file.readline()
    while line and bim_line:
        bim_split = bim_line.split("\t")
        split = line.split("\t")
        if bim_split[3] < split[1]:
            bim_line = bim.readline()
        elif split[1] < bim_split[3]:
            line = file.readline()
        else:
            ic.write(bim_split[1] + "\t" + split[2] + "\n")
            line = file.readline()
            bim_line = bim.readline()
    bim.close()
    ic.close()
    file.close()