Search code examples
pythonliststringlist-manipulation

How to get the difference between two list based on substrings withing each string in the seperate lists


I have two long list, one from a log file that contains lines formatted like

201001050843 blah blah blah <email@site.com> blah blah

and a second file in csv format. I need to generate a list of all the entries in file2 that do not contain a email address in the log file, while maintaining the csv format.

Example
Log file contains:

201001050843 blah blah blah <email@site.com> blah blah
201001050843 blah blah blah <email2@site.com> blah blah

File2 contains:

156456,bob,sagget,email@site.com,4564456
156464,bob,otherguy,email@anothersite.com,45644562

the output should be:

156464,bob,otherguy,email@anothersite.com,45644562

Currently I grab the emails from the log and load them into another list with:

sent_emails =[]
for line in sent:
    try:
        temp1= line.index('<')
        temp2 = line.index('>')
        sent_emails.append(line[temp1+1:temp2])
    except ValueError:
        pass

And then compare to file2 with either:

lista = mail_lista.readlines()
for line in lista:
    temp = line.split()
    for thing in temp:
        try:
            if thing.index('@'):
                if thing in sent_emails:
                    lista.remove(temp)
        except ValueError:
            pass
newa.writelines(lista)

or:

for line in mail_listb:
    temp = line.split()
    for thing in temp:
        try:
            if thing.index('@'):
                if thing not in sent_emails:
                    newb.write(line)
        except ValueError:
            pass

However both return all of file2!

Thanks for any help you can give.

EDIT: Thanks for the recommendations for sets, it made a larger speed difference than I would have thought possible. Way to go hash tables! I will definitively be using sets more often from now on.


Solution

  • You could create the set of emails as you do and then:

    # emails is a set of emails
    for line in fileinput.input("csvfile.csv",inplace =1):
        parts = line.split(',')
        if parts[3] not in emails:
            print line
    

    This only works, if the email in the CSV file is always at position 4.

    fileinput enables in place editing.

    And use a set for the emails instead of a list as Aaron said, not only because of speed but also to eliminate duplicates.