Search code examples
pythonnlpnltkwordnet

How to find abstractness of a word using hyper-/hyponyms in wordnet?


I have 2 words, let's say computer and tool. Computer is a concrete noun whereas tool is relatively abstract. I want to get level of abstractness of each word that will reflect this. I thought the best way to do it is by counting number of hyper/hypo nyms for each word.

  1. Is it possible?
  2. Is there a better way to do it?

Solution

  • The first problem is which meaning of computer would you refer to?

    In WordNet, a word has different "concepts", aka synsets:

    >>> from nltk.corpus import wordnet as wn
    
    >>> wn.synsets('computer')
    [Synset('computer.n.01'), Synset('calculator.n.01')]
    
    >>> wn.synsets('computer')[0].definition()
    'a machine for performing calculations automatically'
    >>> wn.synsets('computer')[1].definition()
    'an expert at calculation (or at operating calculating machines)'
    

    And hyper/hyponyms are not connected to the word computer

    The hyper/hyponyms are concepts, i.e. synsets too, so it's not connected to the form/word but to the possible synsets that might be represented by the word computer, i.e.

    >>> type(wn.synsets('computer')[0])
    <class 'nltk.corpus.reader.wordnet.Synset'>
    
    >>> wn.synsets('computer')[0].hypernyms()
    [Synset('machine.n.01')]
    
    >>> wn.synsets('computer')[0].hyponyms()
    [Synset('analog_computer.n.01'), Synset('digital_computer.n.01'), Synset('home_computer.n.01'), Synset('node.n.08'), Synset('number_cruncher.n.02'), Synset('pari-mutuel_machine.n.01'), Synset('predictor.n.03'), Synset('server.n.03'), Synset('turing_machine.n.01'), Synset('web_site.n.01')]
    

    Yes that's a lot of information but how do I get hyper/hyponyms for words?

    According to the definition, should words have hyper/hyponyms? Or should concept have hypo/hypernyms?

    Fine, you're bringing me in circles... Just tell me how to use hyper-/hyponyms to see if a word is more abstract than another word!!!

    Okay, then we have to make some assumption.

    1. Lets consider all synsets of a word accessed through the WordNet as a "holistic" concept of any word form

    2. We consider the sum of all DIRECT hyper-/hyponyms of all synsets of a given word

    3. Based on the number of hyper-/hyponyms of all synsets that can be represented by a certain word form, we deduce that word X is more/less abstract than word Y

    But how to do (1), (2) and (3) in the code?

    >>> hypernym_count = lambda word: sum(len(ss.hypernyms()) for ss in wn.synsets(word)) 
    >>> hyponym_count = lambda word: sum(len(ss.hyponyms()) for ss in wn.synsets(word)) 
    
    >>> hyponym_count('computer')
    14
    >>> hypernym_count('computer')
    2
    
    
    >>> hypernym_count('tool')
    8
    >>> hyponym_count('tool')
    32
    

    Since (3) is your hypothesis that you want to test, you should be the one deciding what heuristics to deduce if a word is more/less abstract based on the hyponym_count and hypernym_count results

    Wait a minute, what's DIRECT hyper-/hyponyms?

    We're only accessing the hyper-/hyponyms one level above/below the synset. That's what "direct" means here.

    Then how to get all the hyponyms below a synset, see https://stackoverflow.com/a/42012001/610569

    So should I use direct or all hyponyms below or all all hypernyms above?

    That's for you to find out and tell us =) Have fun!