Search code examples
pythonnltk

How to handle abbreviation when reading nltk corpus


I am reading nltk corpus using

def read_corpus(package, category):
    """ Read files from corpus(package)'s category.
        Params:
            package (nltk.corpus): corpus
            category (string): category name
        Return:
            list of lists, with words from each of the processed files assigned with start and end tokens
    """
    files = package.fileids(category)
    return [[START_TOKEN] + [w.lower() for w in list(package.words(f))] + [END_TOKEN] for f in files]

But I find that it process 'U.S.' to ['U','.','S','.'] and 'I'm' to ['I', "'", 'm'].

How can I get an abbreviation as a whole or restore it?


Solution

  • To treat abbreviations such as "U.S." and contractions such as "I'm" as a single token when processing text, you can use the TreebankWordTokenizer from the NLTK library. This tokenizer is designed to tokenize text in a way that is similar to how humans would naturally write and speak, so it will treat abbreviations and contractions as single tokens.