Search code examples
pythonnlpnltkdetection

Determining most popular words in the English dictionary within a dictionary of words


Forgive me if my wording is awful, but I'm trying to figure out how to determine the most used words in the English language from a set of words in a dictionary I've made. I've done some research on NLTK but can't seem to find a function within it (or any other library for that matter) that will help me do what I need to do.

For example: A sentence "I enjoy a cold glass of water on a hot day" would return "water" because it's the most used word in day to day conversation from the sentence. Essentially I need a returned value of the most frequently used word in conversations.

I figure I'll likely have to involve AI, but any time I've tried to use AI I wind up copy and pasting code because I just don't understand it, so I'm trying to avoid going that route

Any and all help is welcome and appreciated.

For context, I decided to start a project that would essentially guess a predetermined word based on characters the user says it has and doesn't have from the computers guess.


Solution

  • You need a external dataset for this task. You can try dataset such as google n gram dataset.

    Here is the breakdown of the problem statement:

    1. Input: "I enjoy a cold glass of water on a hot day". Output: "water".
    2. Split the sentences into words list.

    Example: ["I", "enjoy", "a", "cold", "glass", "of", "water", "on", "a", "hot", "day"]

    1. First loop in through all the word of the sentences. so let say you are at first word "I".
    2. Now you will look the same word "I" in external dataset and will look for the frequency of that word. Let say the word "I" in external dataset is repeated 5000000 times
    3. Repeat this task for all the word.
    4. Now you will have a dictionary where each word of the sentence is key and value is frequency of that word that you will get from external data. Frequency in the below example is random value not exact value.
    {
        "I": 5000000,
        "enjoy": 50000,
        "a": 10000000,
        "cold": 30000,
        "glass": 100000,
        "of": 8000000,
        "water": 1200000,
        "on": 6000000,
        "hot": 700000,
        "day": 400000
    }
    
    1. Pick the word with highest frequency.

    Note: You can try any big corpus as external data. using big corpus will have most of the English word which is used in conversation. And even if the frequency is not mentioned then you can create that yourself