I need to convert abbreviations back using NLP.
Like what's
to what is
, it's
to it is
, etc.
I want to use it to preprocess the raw sentence.
Actually, I also confused about whether I should do this or just simply remove the '
and convert what's
to whats
. Otherwise, anyway, is
will be removed as a stop word in a later step.
In another hand, should we consider whats
and what
as lemma
?
Or, we should use stemmer to cut the s
off?
BTW, I don't think abbreviation is the right term here, but I'm not good at English as well. So, please introduce me the formal NLP or linguistics term we used for what's
, how's
, etc.
Normally, NLP libraries such as Spacy and NLTK do a good job doing tokenization transforming like "It's"
into ["It", "'s"]
. but transforming something like what's
into ["what", "is"]
is more problematic, because you can have examples such `"Amy's ballet studio" where the "'s" is not "is".
You could map all cases (he's, I'm, what's, etc) and add new rules to the tokenizer, Spacy allows that:
import spacy
from spacy.symbols import ORTH, LEMMA, POS, TAG
nlp = spacy.load("en_core_web_sm")
doc = nlp(u"He's buying that") # phrase to tokenize
print([w.text for w in doc]) # ['He', "'s", "buying", 'that']
# add special case rule
special_case = [{ORTH: u"He'", LEMMA: u"He", POS: u"PRONOUM"}, {ORTH: u"is"}]
nlp.tokenizer.add_special_case(u"He's", special_case)
# check new tokenization
print([w.text for w in nlp(u"He's buying that")]) # ["He'", "is", "buying", "that"]
This gist does an extensive job on setting those rules. But I'm not sure if it is worth doing so, maybe it will not have a great impact on the task you have at hand.