I am running LDA on a number of texts. When I generated some visualizations of the produced topics, I found that the bigram "machine_learning" had been lemmatized both as "machine_learning" and "machine_learne". Here is as minimal a reproducible example as I can provide:
import en_core_web_sm
tokenized = [
[
'artificially_intelligent', 'funds', 'generating', 'excess', 'returns',
'artificial_intelligence', 'deep_learning', 'compelling', 'reasons',
'join_us', 'artificially_intelligent', 'fund', 'develop', 'ai',
'machine_learning', 'capabilities', 'real', 'cases', 'big', 'players',
'industry', 'discover', 'emerging', 'trends', 'latest_developments',
'ai', 'machine_learning', 'industry', 'players', 'trading',
'investing', 'live', 'investment', 'models', 'learn', 'develop',
'compelling', 'business', 'case', 'clients', 'ceos', 'adopt', 'ai',
'machine_learning', 'investment', 'approaches', 'rare', 'gathering',
'talents', 'including', 'quants', 'data_scientists', 'researchers',
'ai', 'machine_learning', 'experts', 'investment_officers', 'explore',
'solutions', 'challenges', 'potential', 'risks', 'pitfalls',
'adopting', 'ai', 'machine_learning'
],
[
'recent_years', 'topics', 'data_science', 'artificial_intelligence',
'machine_learning', 'big_data', 'become_increasingly', 'popular',
'growth', 'fueled', 'collection', 'availability', 'data',
'continually', 'increasing', 'processing', 'power', 'storage', 'open',
'source', 'movement', 'making', 'tools', 'widely', 'available',
'result', 'already', 'witnessed', 'profound', 'changes', 'work',
'rest', 'play', 'trend', 'increase', 'world', 'finance', 'impacted',
'investment', 'managers', 'particular', 'join_us', 'explore',
'data_science', 'means', 'finance_professionals'
]
]
nlp = en_core_web_sm.load(disable=['parser', 'ner'])
def lemmatization(descrips, allowed_postags=None):
if allowed_postags is None:
allowed_postags = ['NOUN', 'ADJ', 'VERB',
'ADV']
lemmatized_descrips = []
for descrip in descrips:
doc = nlp(" ".join(descrip))
lemmatized_descrips.append([
token.lemma_ for token in doc if token.pos_ in allowed_postags
])
return lemmatized_descrips
lemmatized = lemmatization(tokenized)
print(lemmatized)
As you will notice, "machine_learne" is found nowhere in the input tokenized
, but both "machine_learning" and "machine_learne" are found in the output lemmatized
.
What is the cause of this and can I expect it to cause issues with other bigrams/trigrams?
I think you misunderstood the process of POS Tagging and Lemmatization.
POS Tagging is based on several other informations than the word alone (I don't know which is your mother language, but that is common to many languages), but also on the surrounding words (for example, one common learned rule is that in many statements verb is usually preceded by a noun, which represents the verb's agent).
When you pass all these 'tokens' to your lemmatizer, spacy's lemmatizer will try to "guess" which is the Part of Speech of your solitary word.
In many cases it'll go for a default noun and, if it is not in a lookup table for common and irregular nouns, it'll attempt to use generic rules (such as stripping plural 's').
In other cases it'll go for a default verb based on some patterns (the "-ing" in the end), which is probably your case. Since no verb "machine_learning" exists in any dictionary (there's no instance in its model), it'll go for a "else" route and apply generic rules.
Therefore, machine_learning is probably being lemmatized by a generic '"ing" to "e"' rule (such as in the case of making -> make, baking -> bake), common to many regular verbs.
Look at this test example:
for descrip in tokenized:
doc = nlp(" ".join(descrip))
print([
(token.pos_, token.text) for token in doc
])
Output:
[('NOUN', 'artificially_intelligent'), ('NOUN', 'funds'), ('VERB', 'generating'), ('ADJ', 'excess'), ('NOUN', 'returns'), ('NOUN', 'artificial_intelligence'), ('NOUN', 'deep_learning'), ('ADJ', 'compelling'), ('NOUN', 'reasons'), ('PROPN', 'join_us'), ('NOUN', 'artificially_intelligent'), ('NOUN', 'fund'), ('NOUN', 'develop'), ('VERB', 'ai'), ('VERB', 'machine_learning'), ('NOUN', 'capabilities'), ('ADJ', 'real'), ('NOUN', 'cases'), ('ADJ', 'big'), ('NOUN', 'players'), ('NOUN', 'industry'), ('VERB', 'discover'), ('VERB', 'emerging'), ('NOUN', 'trends'), ('NOUN', 'latest_developments'), ('VERB', 'ai'), ('VERB', 'machine_learning'), ('NOUN', 'industry'), ('NOUN', 'players'), ('NOUN', 'trading'), ('VERB', 'investing'), ('ADJ', 'live'), ('NOUN', 'investment'), ('NOUN', 'models'), ('VERB', 'learn'), ('VERB', 'develop'), ('ADJ', 'compelling'), ('NOUN', 'business'), ('NOUN', 'case'), ('NOUN', 'clients'), ('NOUN', 'ceos'), ('VERB', 'adopt'), ('VERB', 'ai'), ('ADJ', 'machine_learning'), ('NOUN', 'investment'), ('NOUN', 'approaches'), ('ADJ', 'rare'), ('VERB', 'gathering'), ('NOUN', 'talents'), ('VERB', 'including'), ('NOUN', 'quants'), ('NOUN', 'data_scientists'), ('NOUN', 'researchers'), ('VERB', 'ai'), ('ADJ', 'machine_learning'), ('NOUN', 'experts'), ('NOUN', 'investment_officers'), ('VERB', 'explore'), ('NOUN', 'solutions'), ('VERB', 'challenges'), ('ADJ', 'potential'), ('NOUN', 'risks'), ('NOUN', 'pitfalls'), ('VERB', 'adopting'), ('VERB', 'ai'), ('NOUN', 'machine_learning')]
You are getting both machine_learning as verb and noun based on context. But see that just concatenating the words gets you messy because they are not ordered in Natural language as expected.
Not even a human can understand and correctly POS Tag this text:
artificially_intelligent funds generating excess returns artificial_intelligence deep_learning compelling reasons join_us artificially_intelligent fund develop ai machine_learning capabilities real cases big players industry discover emerging trends latest_developments ai machine_learning industry players trading investing live investment models learn develop compelling business case clients ceos adopt ai machine_learning investment approaches rare gathering talents including quants data_scientists researchers ai machine_learning experts investment_officers explore solutions challenges potential risks pitfalls adopting ai machine_learning