Search code examples
pythonpython-3.xnlp

Building a Word Counter for Analysis


I'm trying to build a Python program similar to the wordcounter.net (https://wordcounter.net/). I have an excel file with one column that has text to be analyzed. Using pandas and other functions, I created a single word frequency counter.

But now, I need to further modify to find patterns.

For example a text has " Happy face sad face mellow little baby sweet Happy face face mellow sad face mellow "

So here, it should be able to trace patterns such as Two word density

  • Pattern Count

  • "Happy face" 2

  • "sad face" 2

  • "face mellow" 3

....

Three word density

  • Pattern Count

  • "Happy face sad" 1

  • "face sad face" 1

....

I also tried :

for match in re.finditer(pattern, line):

But this again has to be done manually and I want it to automatically find the patterns.

Can anyone help on how to proceed for this ?


Solution

  • text = 'Happy face sad face mellow little baby sweet Happy face face mellow sad face mellow'
    
    d = {}
    for s in text.split():
        d.setdefault(s, 0)
        d[s] += 1
    out = {}
    for k, v in d.items():
        out.setdefault(v, []).append(k)
    for i in sorted(out.keys(), reverse=True):
        print(f'{i} word density:')
        print(f'\t{out[i]}')
    

    Output

    5 word density:
        ['face']
    3 word density:
        ['mellow']
    2 word density:
        ['Happy', 'sad']
    1 word density:
        ['little', 'baby', 'sweet']
    

    Edit2

    from collections import Counter
    
    
    def freq(lst, n):
        lstn = []
        for i in range(len(lst) - (n - 1)):
            lstn.append(" ".join([lst[i + x] for x in range(n)]))
        out = Counter(lstn)
        print(f'{n} word density:')
        for k, v in out.items():
            print(f'\t"{k}" {v}')
    
    
    text = 'Happy face sad face mellow little baby sweet Happy face face mellow sad face mellow'
    lst = text.split()
    
    freq(lst, 2)
    freq(lst, 3)
    

    Output

    2 word density:
        "Happy face" 2
        "face sad" 1
        "sad face" 2
        "face mellow" 3
        "mellow little" 1
        "little baby" 1
        "baby sweet" 1
        "sweet Happy" 1
        "face face" 1
        "mellow sad" 1
    3 word density:
        "Happy face sad" 1
        "face sad face" 1
        "sad face mellow" 2
        "face mellow little" 1
        "mellow little baby" 1
        "little baby sweet" 1
        "baby sweet Happy" 1
        "sweet Happy face" 1
        "Happy face face" 1
        "face face mellow" 1
        "face mellow sad" 1
        "mellow sad face" 1