Search code examples
pythonnlp

How to convert contractions words back in NLP


I need to convert abbreviations back using NLP.
Like what's to what is, it's to it is, etc.
I want to use it to preprocess the raw sentence.

Actually, I also confused about whether I should do this or just simply remove the ' and convert what's to whats. Otherwise, anyway, is will be removed as a stop word in a later step.

In another hand, should we consider whats and what as lemma?
Or, we should use stemmer to cut the s off?

BTW, I don't think abbreviation is the right term here, but I'm not good at English as well. So, please introduce me the formal NLP or linguistics term we used for what's, how's, etc.


Solution

  • Normally, NLP libraries such as Spacy and NLTK do a good job doing tokenization transforming like "It's" into ["It", "'s"]. but transforming something like what's into ["what", "is"] is more problematic, because you can have examples such `"Amy's ballet studio" where the "'s" is not "is".

    You could map all cases (he's, I'm, what's, etc) and add new rules to the tokenizer, Spacy allows that:

    import spacy
    from spacy.symbols import ORTH, LEMMA, POS, TAG
    
    nlp = spacy.load("en_core_web_sm")
    doc = nlp(u"He's buying that")  # phrase to tokenize
    print([w.text for w in doc])  # ['He', "'s", "buying", 'that']
    
    # add special case rule
    special_case = [{ORTH: u"He'", LEMMA: u"He", POS: u"PRONOUM"}, {ORTH: u"is"}]
    nlp.tokenizer.add_special_case(u"He's", special_case)
    
    # check new tokenization
    print([w.text for w in nlp(u"He's buying that")])  # ["He'", "is", "buying", "that"]
    

    This gist does an extensive job on setting those rules. But I'm not sure if it is worth doing so, maybe it will not have a great impact on the task you have at hand.