Search code examples
pythonregexlistcsvtokenize

Extract lines in CSV file which don't have elements in a list


I have a list with substrings which I need to compare with a column in CSV file if any of the substring present in the list is present in that column of a CSV file. I would like to write those lines which don't have those substrings in that string column. There are many columns in this file and I am looking only in one column.

Example my_string column has values

{ "This is just comparison of likely tokens","what a tough thing?"}

de = ["just","not","really ", "hat"]

I would like to write only the row which has "What a tough thing?"

This works fine if there is only the word in the list in the column. For instance if the my_string column has "really" it will not write to new file. But, it can't pass if item in list comes with other strings.

with open(infile, 'rb') as inFile, open(outfile, 'wb') as outfile:
reader = csv.reader(inFile, delimiter=',')
writer = csv.writer(outfile, delimiter=',')

for row[1] in reader:

    if any(d in row[1] for d in de):
        pass
    else:
        writer.writerow(row[1])

Solution

  • You can compile the words into a single regex, and even do case insensitive match as follows:

    r = re.compile('\\b('+"|".join(de)+')\\b', re.IGNORECASE)
    

    Then your code could simply be:

    with open(infile, 'rb') as inFile, open(outfile, 'wb') as outfile:
    reader = csv.reader(inFile, delimiter=',')
    writer = csv.writer(outfile, delimiter=',')
    
    for row in reader:
        if not r.search(row[1]):
            writer.writerow(row[1])