I am using WordNet to access synonyms that share a common meaning. Here is an example:
from itertools import chain
from nltk.corpus import wordnet as wn
synsets = wn.synsets("drink")
# synsets = [Synset('drink.n.01'), Synset('drink.n.02'), Synset('beverage.n.01'), ...]
synonyms = set(chain(*(x.lemma_names() for x in synsets)))
# synonyms = {'drinking', 'drinkable', 'crapulence', 'toast', 'drink', 'drunkenness', ...}
Are synsets sorted? And, in case they are, what are the criteria? Are the first synsets of the list those which have higher chances to be correlated to the given word?
I would like to limit the number of synonyms by keeping only the "most important" ones (what "important" means in this context is to be defined, but I wonder whether WordNet has its own concept of "important").
If synsets are not sorted, what could be an alternative way to find the most appropriate synonyms of a word?
The documentation has a relevant section: https://www.nltk.org/howto/wordnet.html#similarity
Various similarity finding methods are provided: path_similarity
, lch_similarity
, wup_similarity
, res_similarity
, etc.
For example, from the documentation (for path_similarity
):
synset1.path_similarity(synset2): Return a score denoting how similar two word senses are, based on the shortest path that connects the senses in the is-a (hypernym/hypnoym) taxonomy. The score is in the range 0 to 1.
You can use the method in the following format:
# Assuming we are comparing with 0th synset of "drink"
syn_to_compare = wn.synsets("drink")[0]
all_synsets = wn.synsets("drink")
corr = [(all_synsets[i],syn_to_compare.path_similarity(all_synsets[i])) for i in range(len(all_synsets))]
Will generate an output like:
[(Synset('drink.n.01'), 1.0), (Synset('drink.n.02'), 0.06666666666666667), (Synset('beverage.n.01'), 0.08333333333333333), (Synset('drink.n.04'), 0.09090909090909091), (Synset('swallow.n.02'), 0.07692307692307693), (Synset('drink.v.01'), None), (Synset('drink.v.02'), None), (Synset('toast.v.02'), None), (Synset('drink_in.v.01'), None), (Synset('drink.v.05'), None)]
You can then sort them using sorted() method providing the similarity_score as value.
sorted(corr, key=lambda x: x[1] if x[1] != None else 0, reverse=True)
[(Synset('drink.n.01'), 1.0), (Synset('drink.n.04'), 0.09090909090909091), (Synset('beverage.n.01'), 0.08333333333333333), (Synset('swallow.n.02'), 0.07692307692307693), (Synset('drink.n.02'), 0.06666666666666667), (Synset('drink.v.01'), None), (Synset('drink.v.02'), None), (Synset('toast.v.02'), None), (Synset('drink_in.v.01'), None), (Synset('drink.v.05'), None)]
If you want to deal with proper nouns, I suggest looking into gensim's most_similar() method.
Are synsets sorted? And, in case they are, what are the criteria? Are the first synsets of the list those which have higher chances to be correlated to the given word?
I cannot answer this question decisively, however I don't think there is a criteria. You can use the above method to find most similar words based on a particular synset.
Edit: As mentioned in the comments below, the author of the question was looking for an order in the list returned by wordnet's synsets()
method.
From the code available on Github: https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L1563 for the method synset()
if lang == "eng":
get_synset = self.synset_from_pos_and_offset
index = self._lemma_pos_offset_map
if pos is None:
pos = POS_LIST
return [
get_synset(p, offset)
for p in pos
for form in self._morphy(lemma, p, check_exceptions)
for offset in index[form].get(p, [])
]
where POS_LIST
has the value: POS_LIST = [NOUN, VERB, ADJ, ADV]
.
Therefore, preference is given the order mentioned above. Furthermore, according to their code: NOUN="n", VERB="v", ADJ="a", ADV="r"
So the order primarily depends on nltk's pos
tag based on POS_LIST
, followed by what the method _morphy()
returns with lemma
and pos
tag, followed by what _lemma_pos_offset_map()
returns.
For example:
>>> POS_LIST = ["n", "v", "a", "r"]
>>> syn = list()
>>> lemma = "drink"
>>> for p in POS_LIST:
... for form in wn._morphy(lemma, p, True):
... for offset in wn._lemma_pos_offset_map[form].get(p, []):
... syn.append(wn.synset_from_pos_and_offset(p, offset))
...
>>> syn
[Synset('drink.n.01'), Synset('drink.n.02'), Synset('beverage.n.01'), Synset('drink.n.04'), Synset('swallow.n.02'), Synset('drink.v.01'), Synset('drink.v.02'), Synset('toast.v.02'), Synset('drink_in.v.01'), Synset('drink.v.05')]
>>> # You can verify it with what synsets() is providing
...
KeyboardInterrupt
>>> wn.synsets("drink")
[Synset('drink.n.01'), Synset('drink.n.02'), Synset('beverage.n.01'), Synset('drink.n.04'), Synset('swallow.n.02'), Synset('drink.v.01'), Synset('drink.v.02'), Synset('toast.v.02'), Synset('drink_in.v.01'), Synset('drink.v.05')]
>>>
Hope the updated answer is helpful!