Search code examples
pythonregexlistpunctuation

remove only the unknown words from a text but leave punctuation and digits


I have a text in French containing words that are separated by space (e.g répu blique*). I want to remove these separated words from the text and append them into a list while keeping punctuation and digits in the text. My code works for appending the words that are separated but it does not work to keep the digits in the text.

import nltk
from nltk.tokenize import word_tokenize

import re

with open ('french_text.txt') as tx: 
#opening text containing the separated words
    #stores the text with the separated words
    text = word_tokenize(tx.read().lower()) 


with open ('Fr-dictionary.txt') as fr:  #opens the dictionary
    dic = word_tokenize(fr.read().lower()) #stores the first dictionary

pat=re.compile(r'[.?\-",:]+|\d+')

out_file=open("newtext.txt","w") #defining name of output file
valid_words=[ ] #empty list to append the words checked by the dictionary 
invalid_words=[ ] #empty list to append the errors found

for word in text:
    reg=pat.findall(word)
    if reg is True:
        valid_words.append(word)
    elif word in dic:
        valid_words.append(word)#appending to a list the words checked 
    else:
        invalid_words.append(word) #appending the invalid_words



a=' '.join(valid_words) #converting list into a string

print(a) #print converted list
print(invalid_words) #print errors found

out_file.write(a) #writing the output to a file

out_file.close()

so, with this code, my list of errors come with the digits.

['ments', 'prési', 'répu', 'blique', 'diri', 'geants', '»', 'grand-est', 'elysée', 'emmanuel', 'macron', 'sncf', 'pepy', 'montparnasse', '1er', '2017.', 'geoffroy', 'hasselt', 'afp', 's', 'empare', 'sncf', 'grand-est', '26', 'elysée', 'emmanuel', 'macron', 'sncf', 'saint-dié', 'epinal', '23', '2018', 'etat', 's', 'vosges', '2018']

I think the problem is with the regular expression. Any suggestions? Thank you!!


Solution

  • The problem is with your if statement where you check reg is True. You should not use the is operator with True to check if the result of pat.findall(word) was positive (i.e. you had a matching word).

    You can do this instead:

    for word in text:
        if pat.match(word):
            valid_words.append(word)
        elif word in dic:
            valid_words.append(word)#appending to a list the words checked 
        else:
            invalid_words.append(word) #appending the invalid_words