Search code examples
pythonstringtextcountpython-collections

Python - count word frequency of string from list, number of words from list varies


I am trying to create a program that runs though a list of mental health terms, looks in a research abstract, and counts the number of times the word or phrase appears. I can get this to work with single words, but I'm struggling to do this with multiple words. I tried using NLTK ngrams too, but since the number of words from the mental health list varies (i.e., not all terms from the mental health list will be bigrams or trigrams), I couldn't get that to work either.

I want to emphasize that I know splitting each word will only allow single words to be counted, however, I'm just stuck on how to deal with a varying number of words from my list to count in the abstract.

Thanks!

from collections import Counter

abstracts = ['This is a mental health abstract about anxiety and bipolar 
disorder as well as other things.', 'While this abstract is not about ptsd 
or any trauma-related illnesses, it does have a mental health focus.']

for x2 in abstracts:


    mh_terms = ['bipolar disorder', 'anxiety', 'substance abuse disorder', 
    'ptsd', 'schizophrenia', 'mental health']

    c = Counter(s.lower().replace('.', '') for s in x2.split())
    for term in mh_terms:
        term = term.replace(',','')
        term = term.replace('.','')
        xx = (term, c.get(term, 0))

    mh_total_occur = sum(c.get(v, 0) for v in mh_terms)
    print(mh_total_occur)

From my example, both abstracts are getting a count of 1, but I want a count of two.


Solution

  • The problem is that you will never match "mental health" as you are only counting occurrences of single words split by the " " character.

    I don't know if using a counter is the right solution here. If you did need an highly scalable and indexable solution, then n-grams are probably the way to go, but for small to medium problems it should be pretty quick to use regex pattern matching.

    import re
    
    abstracts = [
        'This is a mental health abstract about anxiety and bipolar disorder as well as other things.',
        'While this abstract is not about ptsd or any trauma-related illnesses, it does have a mental health focus.'
    ]
    
    mh_terms = [
        'bipolar disorder', 'anxiety', 'substance abuse disorder',
        'ptsd', 'schizophrenia', 'mental health'
    ]
    
    def _regex_word(text):
        """ wrap text with special regex expression for start/end of words """
        return '\\b{}\\b'.format(text)
    
    def _normalize(text):
        """ Remove any non alpha/numeric/space character """
        return re.sub('[^a-z0-9 ]', '', text.lower())
    
    
    normed_terms = [_normalize(term) for term in mh_terms]
    
    
    for raw_abstract in abstracts:
        print('--------')
        normed_abstract = _normalize(raw_abstract)
    
        # Search for all occurrences of chosen terms
        found = {}
        for norm_term in normed_terms:
            pattern = _regex_word(norm_term)
            found[norm_term] = len(re.findall(pattern, normed_abstract))
        print('found = {!r}'.format(found))
        mh_total_occur = sum(found.values())
        print('mh_total_occur = {!r}'.format(mh_total_occur))
    

    I tried to add helpers functions and comments to make it clear what I was doing.

    Using the \b regex control character is important in general use cases because it prevents possible search terms like "miss" from matching words like "dismiss".