Search code examples

Count the number of times a group of words appear in a text

I have 4 lists of words that categorise something and a tokenised text by word.

animals = ["cat", "dog", "fish"]
colours = ["blue", "red", "green"]
food = ["pasta", "chips", "beef"]
sport = ["football", "basketball", "tennis"]

text = ["Once","upon","a","time",.......]

I would like to count the number of occurrences of the words in these lists in a certain text but as a sum of the words for each list. Therefore the results would show an occurrence of 10 animal words, 20 colour words, 6 food words and 13 sport words across the whole text.

The data I'm actually working on is quite large, so anything that works quickly is required.

Thanks for any help!


  • You could change your categories to a dict of set objects (which will allow for O(1) membership tests):

    categories = {'animals': {'cat', 'dog', 'fish'},
                  'colours': {'blue', 'green', 'red'},
                  'food': {'beef', 'chips', 'pasta'},
                  'sport': {'basketball', 'football', 'tennis'}}

    Then iterate over the words and perform membership tests for each category set:

    def count_words(text, categories):
        counts = dict.fromkeys(categories, 0)
        for word in text:
            for cat_name, cat_words in categories.items():
                counts[cat_name] += word in cat_words
        return counts


    In [19]: text = "Once upon a time there was a proper minimal reproducible example given by the OP without anybody having to ask for it".split()
    In [20]: count_words(text, categories)
    Out[20]: {'animals': 0, 'colours': 0, 'food': 0, 'sport': 0}
    In [21]: text = ("cat dog fish "*3).split()
    In [22]: count_words(text, categories)
    Out[22]: {'animals': 9, 'colours': 0, 'food': 0, 'sport': 0}