I have two long list, one from a log file that contains lines formatted like
201001050843 blah blah blah <email@site.com> blah blah
and a second file in csv format. I need to generate a list of all the entries in file2 that do not contain a email address in the log file, while maintaining the csv format.
Example
Log file contains:
201001050843 blah blah blah <email@site.com> blah blah
201001050843 blah blah blah <email2@site.com> blah blah
File2 contains:
156456,bob,sagget,email@site.com,4564456
156464,bob,otherguy,email@anothersite.com,45644562
the output should be:
156464,bob,otherguy,email@anothersite.com,45644562
Currently I grab the emails from the log and load them into another list with:
sent_emails =[]
for line in sent:
try:
temp1= line.index('<')
temp2 = line.index('>')
sent_emails.append(line[temp1+1:temp2])
except ValueError:
pass
And then compare to file2 with either:
lista = mail_lista.readlines()
for line in lista:
temp = line.split()
for thing in temp:
try:
if thing.index('@'):
if thing in sent_emails:
lista.remove(temp)
except ValueError:
pass
newa.writelines(lista)
or:
for line in mail_listb:
temp = line.split()
for thing in temp:
try:
if thing.index('@'):
if thing not in sent_emails:
newb.write(line)
except ValueError:
pass
However both return all of file2!
Thanks for any help you can give.
EDIT: Thanks for the recommendations for sets, it made a larger speed difference than I would have thought possible. Way to go hash tables! I will definitively be using sets more often from now on.
You could create the set of emails as you do and then:
# emails is a set of emails
for line in fileinput.input("csvfile.csv",inplace =1):
parts = line.split(',')
if parts[3] not in emails:
print line
This only works, if the email in the CSV file is always at position 4.
fileinput enables in place editing.
And use a set for the emails instead of a list as Aaron said, not only because of speed but also to eliminate duplicates.