Search code examples
pythonwindowspython-2.xprefixstartswith

Python startswith function: printing non partially matched lines


Hi I want to use the startswith function to print out the lines in fileY.txt which are NOT partially matched with the lines in fileX.txt

In the script below I use fileX.txt and fileY.txt as lists. I then search fileX.txt for a partial match with fileY.txt using the startswith function.

Next I attempt to print the lines which are NOT partially matched between fileX.txt and fileY.txt. However the script only prints the last line in fileY.txt

Any help will suggestions will be appreciated (I don't mind if I have to use a helper app like sed for example)

Source:

#load lines from file into lists
lines1 = [line1.rstrip('\n') for line1 in open('fileX.txt')]
lines2 = [line2.rstrip('\n') for line2 in open('fileY.txt')]

#set lines
set_of_lines1 = set(lines1)
set_of_lines2 = set(lines2)

#set common
common = set_of_lines1 & set_of_lines2

#return lines which partially match as variable e
[e for e in lines1 if e.startswith(tuple(lines2))]

#minus partially matched lines from fileY.txt
difference = set_of_lines2 - e

#print the non matching lines
for color in difference:
   print 'The color prefix ' + color + ' does not exist in the list'

fileX.txt:

blue
green
red

fileY.txt:

blu
gre
re
whi
oran

What I want:

C:\Users\Foo\Bar\Python\Test\>C:\python27\python Test.py
The color prefix whi does not exist in the list
The color prefix oran does not exist in the list

Press any key to continue . . .

Solution

  • The first problem is with this line:

    [e for e in lines1 if e.startswith(tuple(lines2))]
    

    It constructs a list of partial matches, and then throws it away. All you retain is the value of e which has leaked out of the list comprehension (and in Python 3 would give you an undefined value error). You need:

    partial_match = [e for e in lines1 if e.startswith(tuple(lines2))]
    

    which brings us to the second problem. If you print out partial_match, you will see that it contains ['blue', 'green', 'red'] and I think you are expecting it to contain ['blu', 'gre', 're'], because you are trying to do a set difference between it and set(['blu', 're', 'gre', 'whi', 'oran']).

    Since your problems revolve around the list comprehension I suggest you unwind it into a loop where you can print out intermediate values so you can see what is going on and get the logic right. If you really want a one-liner you can always rewrite it later.

    Like this:

    matches = []
    for prefix in lines2:
        for colour in lines1:
            if colour.startswith(prefix):
                matches.append(prefix)
    

    matches will now contain ['blu', 'gre', 're']. Now report on the prefixes that are not matches.

    for nomatch in set(lines2) - set(matches):
        print "The color prefix %r does not exist in the list" % nomatch
    

    This will give you the output:

    The color prefix 'whi' does not exist in the list
    The color prefix 'oran' does not exist in the list