Search code examples
pythonnlpcluster-analysisword2vecwordnet

Merge related words in NLP


I'd like to define a new word which includes count values from two (or more) different words. For example:

Words Frequency
0   mom 250
1   2020    151
2   the 124
3   19  82
4   mother  81
... ... ...
10  London  6
11  life    6
12  something   6

I would like to define mother as mom + mother:

Words Frequency
0   mother  331
1   2020    151
2   the 124
3   19  82
... ... ...
9   London  6
10  life    6
11  something   6

This is a way to alternative define group of words having some meaning (at least for my purpose).

Any suggestion would be appreciated.


Solution

  • UPDATE 10-21-2020

    I decided to build a Python module to handle the tasks that I outlined in this answer. The module is called wordhoard and can be downloaded from pypi


    I have attempted to use Word2vec and WordNet in projects where I needed to determine the frequency of a keyword (e.g. healthcare) and the keyword's synonyms (e.g., wellness program, preventive medicine). I found that most NLP libraries didn't produce the results that I needed, so I decided to build my own dictionary with custom keywords and synonyms. This approached has worked for both analyzing and classification text in multiple projects.

    I'm sure that someone that is versed in NLP technology might have a more robust solution, but the one below is similar ones that have worked for me time and time again.

    I coded my answer to match the Words Frequency data you had in your question, but it can be modified to use any keyword and synonyms dataset.

    import string
    
    # Python Dictionary
    # I manually created these word relationship - primary_word:synonyms
    word_relationship = {"father": ['dad', 'daddy', 'old man', 'pa', 'pappy', 'papa', 'pop'],
              "mother": ["mamma", "momma", "mama", "mammy", "mummy", "mommy", "mom", "mum"]}
    
    # This input text is from various poems about mothers and fathers
    input_text = 'The hand that rocks the cradle also makes the house a home. It is the prayers of the mother ' \
             'that keeps the family strong. When I think about my mum, I just cannot help but smile; The beauty of ' \
             'her loving heart, the easy grace in her style. I will always need my mom, regardless of my age. She ' \
             'has made me laugh, made me cry. Her love will never fade. If I could write a story, It would be the ' \
             'greatest ever told. I would write about my daddy, For he had a heart of gold. For my father, my friend, ' \
             'This to me you have always been. Through the good times and the bad, Your understanding I have had.'
    
    # converts the input text to lowercase and splits the words based on empty space.
    wordlist = input_text.lower().split()
    
    # remove all punctuation from the wordlist
    remove_punctuation = [''.join(ch for ch in s if ch not in string.punctuation) 
    for s in wordlist]
    
    # list for word frequencies
    wordfreq = []
    
    # count the frequencies of a word
    for w in remove_punctuation:
    wordfreq.append(remove_punctuation.count(w))
    
    word_frequencies = (dict(zip(remove_punctuation, wordfreq)))
    
    word_matches = []
    
    # loop through the dictionaries
    for word, frequency in word_frequencies.items():
       for keyword, synonym in word_relationship.items():
          match = [x for x in synonym if word == x]
          if word == keyword or match:
            match = ' '.join(map(str, match))
            # append the keywords (mother), synonyms(mom) and frequencies to a list
            word_matches.append([keyword, match, frequency])
    
    # used to hold the final keyword and frequencies
    final_results = {}
    
    # list comprehension to obtain the primary keyword and its frequencies
    synonym_matches = [(keyword[0], keyword[2]) for keyword in word_matches]
    
    # iterate synonym_matches and output total frequency count for a specific keyword
    for item in synonym_matches:
      if item[0] not in final_results.keys():
        frequency_count = 0
        frequency_count = frequency_count + item[1]
        final_results[item[0]] = frequency_count
      else:
        frequency_count = frequency_count + item[1]
        final_results[item[0]] = frequency_count
    
     
    print(final_results)
    # output
    {'mother': 3, 'father': 2}
    

    Other Methods

    Below are some other methods and their out-of-box output.


    NLTK WORDNET

    In this example, I looked up the synonyms for the word 'mother.' Note that WordNet does not have the synonyms 'mom' or 'mum' linked to the word mother. These two words are within my sample text above. Also note that the word 'father' is listed as a synonym for 'mother.'

    from nltk.corpus import wordnet
    
    synonyms = []
    word = 'mother'
    for synonym in wordnet.synsets(word):
       for item in synonym.lemmas():
          if word != synonym.name() and len(synonym.lemma_names()) > 1:
            synonyms.append(item.name())
    
    print(synonyms)
    ['mother', 'female_parent', 'mother', 'fuss', 'overprotect', 'beget', 'get', 'engender', 'father', 'mother', 'sire', 'generate', 'bring_forth']
    

    PyDictionary

    In this example, I looked up the synonyms for the word 'mother' using PyDictionary, which queries synonym.com. The synonyms in this example include the words 'mom' and 'mum.' This example also includes additional synonyms that WordNet did not generate.

    BUT, PyDictionary also produced a synonym list for 'mum.' Which has nothing to do with the word 'mother.' It seems that PyDictionary pulled this list from the adjective section of the page instead of the noun section. It's hard for a computer to distinguish between the adjective mum and the noun mum.

    from PyDictionary import PyDictionary
    dictionary_mother = PyDictionary('mother')
    
    print(dictionary_mother.getSynonyms())
    # output 
    [{'mother': ['mother-in-law', 'female parent', 'supermom', 'mum', 'parent', 'mom', 'momma', 'para I', 'mama', 'mummy', 'quadripara', 'mommy', 'quintipara', 'ma', 'puerpera', 'surrogate mother', 'mater', 'primipara', 'mammy', 'mamma']}]
    
    dictionary_mum = PyDictionary('mum')
    
    print(dictionary_mum.getSynonyms())
    # output 
    [{'mum': ['incommunicative', 'silent', 'uncommunicative']}]
    

    Some of the other possible approaches are using the Oxford Dictionary API or querying thesaurus.com. Both these methods also have pitfalls. For instance the Oxford Dictionary API requires an API key and a paid subscription based on query numbers. And thesaurus.com is missing potential synonyms that could be useful in grouping words.

    https://www.thesaurus.com/browse/mother
    synonyms: mom, parent, ancestor, creator, mommy, origin, predecessor, progenitor, source, child-bearer, forebearer, procreator
    

    UPDATE

    Producing a precise synonym lists for each potential word in your corpus is hard and will require a multiple prong approach. The code below using WordNet and PyDictionary to create a superset of synonyms. Like all the other answers, this combine methods also leads to some over counting of word frequencies. I've been trying to reduce this over-counting by combining key and value pairs within my final dictionary of synonyms. The latter problem is much harder than I anticipated and might require me to open my own question to solve. In the end, I think that based on your use case you need to determine, which approach works best and will likely need to combine several approaches.

    Thanks for posting this question, because it allowed me to look at other methods for solving a complex problem.

    from string import punctuation
    from nltk.corpus import stopwords
    from nltk.corpus import wordnet
    from PyDictionary import PyDictionary
    
    input_text = """The hand that rocks the cradle also makes the house a home. It is the prayers of the mother
             that keeps the family strong. When I think about my mum, I just cannot help but smile; The beauty of
             her loving heart, the easy grace in her style. I will always need my mom, regardless of my age. She
             has made me laugh, made me cry. Her love will never fade. If I could write a story, It would be the
             greatest ever told. I would write about my daddy, For he had a heart of gold. For my father, my friend,
             This to me you have always been. Through the good times and the bad, Your understanding I have had."""
    
    
    def normalize_textual_information(text):
       # split text into tokens by white space
       token = text.split()
    
       # remove punctuation from each token
       table = str.maketrans('', '', punctuation)
       token = [word.translate(table) for word in token]
    
       # remove any tokens that are not alphabetic
       token = [word.lower() for word in token if word.isalpha()]
    
       # filter out English stop words
       stop_words = set(stopwords.words('english'))
    
       # you could add additional stops like this
       stop_words.add('cannot')
       stop_words.add('could')
       stop_words.add('would')
    
       token = [word for word in token if word not in stop_words]
    
       # filter out any short tokens
       token = [word for word in token if len(word) > 1]
       return token
    
    
    def generate_word_frequencies(words):
       # list to hold word frequencies
       word_frequencies = []
    
       # loop through the tokens and generate a word count for each token
       for word in words:
          word_frequencies.append(words.count(word))
    
       # aggregates the words and word_frequencies into tuples and coverts them into a dictionary
       word_frequencies = (dict(zip(words, word_frequencies)))
    
       # sort the frequency of the words from low to high
       sorted_frequencies = {key: value for key, value in 
       sorted(word_frequencies.items(), key=lambda item: item[1])}
    
     return sorted_frequencies
    
    
    def get_synonyms_internet(word):
       dictionary = PyDictionary(word)
       synonym = dictionary.getSynonyms()
       return synonym
    
     
    words = normalize_textual_information(input_text)
    
    all_synsets_1 = {}
    for word in words:
      for synonym in wordnet.synsets(word):
        if word != synonym.name() and len(synonym.lemma_names()) > 1:
          for item in synonym.lemmas():
            if word != item.name():
              all_synsets_1.setdefault(word, []).append(str(item.name()).lower())
    
    all_synsets_2 = {}
    for word in words:
      word_synonyms = get_synonyms_internet(word)
      for synonym in word_synonyms:
        if word != synonym and synonym is not None:
          all_synsets_2.update(synonym)
    
     word_relationship = {**all_synsets_1, **all_synsets_2}
    
     frequencies = generate_word_frequencies(words)
     word_matches = []
     word_set = {}
     duplication_check = set()
    
     for word, frequency in frequencies.items():
        for keyword, synonym in word_relationship.items():
           match = [x for x in synonym if word == x]
           if word == keyword or match:
             match = ' '.join(map(str, match))
             if match not in word_set or match not in duplication_check or word not in duplication_check:
                duplication_check.add(word)
                duplication_check.add(match)
                word_matches.append([keyword, match, frequency])
    
     # used to hold the final keyword and frequencies
     final_results = {}
    
     # list comprehension to obtain the primary keyword and its frequencies
     synonym_matches = [(keyword[0], keyword[2]) for keyword in word_matches]
    
     # iterate synonym_matches and output total frequency count for a specific keyword
     for item in synonym_matches:
        if item[0] not in final_results.keys():
          frequency_count = 0
          frequency_count = frequency_count + item[1]
          final_results[item[0]] = frequency_count
     else:
        frequency_count = frequency_count + item[1]
        final_results[item[0]] = frequency_count
    
    # do something with the final results