machine-learning nlp classification word-embedding

Open source pre-trained models for taxonomy/general word classification

Given any two words I'd like to understand if there's some sort of taxonomy/semantic field based relationship. For example given the words "Dog" and "Cat" I'd like to have a model which can return words in which "Dog" and "Cat" match, for example some words that this model would return in this case could be "Animal", "Mammal", "Pet" etcetera.

Is there an open source pre-trained model that can do this out of the box requiring no training dataset beforehand?

Solution

Sounds like WordNet would be a good fit for this task. WordNet is a lexical database that organises words in a hierarchical tree structure, like a taxonomy, and contains additional semantic information for many words. See e.g. WordNet for "cat" here for a browser-based demo. A word that's one hierarchy level above another word is a so called 'hypernym'. The hypernym for cat is e.g. 'feline'. With WordNet in NLTK you can get the hypernyms of two words until you get the same hypernym.

For 'cat' and 'dog' the common hypernym is 'animal'. See example code here:

from nltk.corpus import wordnet as wn

wn.synsets('cat')
# output: [Synset('cat.n.01'), Synset('guy.n.01'), Synset('cat.n.03'), Synset('kat.n.01'),  Synset('cat-o'-nine-tails.n.01'), Synset('caterpillar.n.02'), ...]
wn.synset('cat.n.01').hypernyms()
# output: [Synset('feline.n.01')]
wn.synset('feline.n.01').hypernyms()
wn.synset('carnivore.n.01').hypernyms()
wn.synset('placental.n.01').hypernyms()
wn.synset('mammal.n.01').hypernyms()
wn.synset('vertebrate.n.01').hypernyms()
wn.synset('chordate.n.01').hypernyms()
# output: 'animal'

wn.synsets('dog')
# output: [Synset('dog.n.01'), Synset('frump.n.01'), Synset('dog.n.03'), Synset('cad.n.01'), Synset('pawl.n.01'), Synset('chase.v.01')]
wn.synset('dog.n.01').hypernyms()
wn.synset('domestic_animal.n.01').hypernyms()
# output: 'animal'

You ask for a machine learning solution in your question. A classical approach would be word vectors via Gensim, but they will not give you a clear common category based on a database created by experts (like WordNet), but just give you words that often occur next to your target words ("cat", "dog") in the training data. I think that machine learning is not necessarily the best tool here. See example:

import gensim.downloader as api

model_glove = api.load("glove-wiki-gigaword-100")

model_glove.most_similar(positive=["dog", "cat"], negative=None, topn=10)

# output: [('dogs', 0.7998143434524536),
 ('pet', 0.7550237774848938),
 ('puppy', 0.7239114046096802),
 ('rabbit', 0.7165164351463318),
 ('cats', 0.7114559412002563),
 ('monkey', 0.6967265605926514),
 ('horse', 0.6890867948532104),
 ('animal', 0.6713783740997314),
 ('mouse', 0.6644925475120544),
 ('boy', 0.6607726812362671)]