Search code examples
pythonenumerate

get non-matching line numbers python


Hi I wrote a simple code in python to do the following:

I have two files summarizing genomic data. The first file has the names of loci I want to get rid of, it looks something like this

File_1:

R000002
R000003
R000006

The second file has the names and position of all my loci and looks like this:

File_2:

R000001 1
R000001 2
R000001 3
R000002 10
R000002 2
R000002 3
R000003 20
R000003 3
R000004 1
R000004 20
R000004 4
R000005 2
R000005 3
R000006 10
R000006 11
R000006 123

What I wish to do is get all the corresponding line numbers of loci from File2 that are not in File1, so the end result should look like this:

Result:

1
2
3
9
10
11
12
13

I wrote the following simple code and it gets the job done

#!/usr/bin/env python

import sys

File1 = sys.argv[1]
File2 = sys.argv[2]

F1 = open(File1).readlines()
F2 = open(File2).readlines()
F3 = open(File2 + '.np', 'w')
Loci = []

for line in F1:
        Loci.append(line.strip())

for x, y in enumerate(F2):
        y2 = y.strip().split()
        if y2[0] not in Loci:
                F3.write(str(x+1) + '\n')

However when I run this on my real data set where the first file has 58470 lines and the second file has 12881010 lines it seems to take forever. I am guessing that the bottleneck is in the

if y2[0] not in Loci:

part where the code has to search through the whole of File_2 repeatedly but I have not been able to find a speedier solution.

Can anybody help me out and show a more pythonic way of doing things.

Thanks in advance


Solution

  • Here's some slightly more Pythonic code that doesn't care if your files are ordered. I'd prefer to just print everything out and redirect it to a file ./myscript.py > outfile.txt, but you could also pass in another filename and write to that.

    #!/usr/bin/env python
    import sys
    
    ignore_f = sys.argv[1]
    loci_f = sys.argv[2]
    
    with open(ignore_f) as f:
        ignore = set(x.strip() for x in f)
    
    with open(loci_f) as f:
        for n, line in enumerate(f, start=1):
            if line.split()[0] not in ignore:
                print n
    

    Searching for something in a list is O(n), while it takes only O(1) for a set. If order doesn't matter and you have unique things, use a set over a list. While this isn't optimal, it should be O(n) instead of O(n × m) like your code.

    You're also not closing your files, which when reading from isn't that big of a deal, but when writing it is. I use context managers (with) so Python does that for me.

    Style-wise, use descriptive variable names. and avoid UpperCase names, those are typically used for classes (see PEP-8).

    If your files are ordered, you can step through them together, ignoring lines where the loci names are the same, then when they differ, take another step in your ignore file and recheck.