Search code examples
pythonnltklemmatization

Lemmatizing txt file and replacing only lemmatized words


Having trouble figuring out how to lemmatize words from a txt file. I've gotten as far as listing the words, but I'm not sure how to lemmatize them after the fact.

Here's what I have:

import nltk, re
nltk.download('wordnet')
from nltk.stem.wordnet import WordNetLemmatizer

def lemfile():
    f = open('1865-Lincoln.txt', 'r')
    text = f.read().lower()
    f.close()
    text = re.sub('[^a-z\ \']+', " ", text)
    words = list(text.split())

Solution

  • Initialise a WordNetLemmatizer object, and lemmatize each word in your lines. You can perform inplace file I/O using the fileinput module.

    # https://stackoverflow.com/a/5463419/4909087
    import fileinput
    
    lemmatizer = WordNetLemmatizer()
    for line in fileinput.input('1865-Lincoln.txt', inplace=True, backup='.bak'):
        line = ' '.join(
            [lemmatizer.lemmatize(w) for w in line.rstrip().split()]
        )
        # overwrites current `line` in file
        print(line)
    

    fileinput.input redirects stdout to the open file when it is in use.