Search code examples
pythonnltkdata-cleaninglemmatizationnormalizing

How to lemmatize a .txt file rather than a sentence with pywsd.utils?


I am quite new with Python that I try to learn for basic text analysis, topic modelling etc.

I wrote the following code for cleaning my text file. I prefer pywsed.utils lemmatize.sentence() function to NLTK's WordNetLemmatizer() because it produces cleaner texts. The following code works fine with sentences:

from nltk.corpus import stopwords
from pywsd.utils import lemmatize_sentence
import string

s = "Dew drops fall from the leaves. Mary leaves the room. It's completed. Hello. This is trial. We went home. It was easier. We drank tea. These are Demo Texts. Right?"

lemm = lemmatize_sentence(s)
print (lemm)

stopword = stopwords.words('english') + list(string.punctuation)
removingstopwords = [word for word in lemm if word not in stopword]
print (removingstopwords, file=open("cleaned.txt","a"))

But what I fail to do is lemmatizing a raw text file in a directory. I guess lemmatize.sentence() only requires strings?

I manage to read contents of a file with

with open ('a.txt',"r+", encoding="utf-8") as fin:
    lemm = lemmatize_sentence(fin.read())
print (lemm)

but this time the code fails to remove some keywords like "n't", "'ll", "'s", or "‘" and punctuations which result in an uncleaned text.

1) What do I do wrong? Should I tokenize first? (I also failed to feed lemmatize.sentence() with its results).

2) How do I get the output file content without any formatting (words without single quotes and bracket)?

Any help is greatly appreciated. Thanks in advance.


Solution

  • Simply apply lemmatize to each line, one-by-one, and then append that to a string with a new line. So essentially, it's doing the same thing. Except doing each line, appending it to a temp string and seperating each by a new line, then at the end we print out temp string. You can use the temp string at the end as final output.

    my_temp_string = ""
    with open ('a.txt',"r+", encoding="utf-8") as fin:
        for line in fin:
            lemm = lemmatize_sentence(line)
            my_temp_string += f'{lemm} \n'
    print (my_temp_string)