Search code examples
pythonpandasnumpycountercountvectorizer

Counting word frequency in original file and mapping them


I'm trying to use a modified version of count vectorizer where I use it to fit on a series. Then I get the sum of all the counts for values in cells. E.g: This is my series on which I'm fitting the count vectorizer.

["dog cat mouse", " cat mouse", "mouse mouse cat"]

The end result should look something like:

[1+3+4, 3+4, 4+4+3]

I've tried using Counter but it doesn't really work in this case. So far I've only been successful in getting a sparse matrix but that prints out the total number of elements in the cell. However I want to map the count to the entire series.


Solution

  • The items of the counter list can only be stored in the form of string, later a string can be evaluated using eval()

    Code:

    lst = ["dog cat mouse", " cat mouse", "mouse mouse cat"]
    res = {}
    res2 = []
    for i in lst:
        for j in i.split(' '):
            if j not in res.keys():
                res[j] = 1
            else:
                res[j] += 1
    
    for i in lst:
        res2.append('+'.join([str(res[j]) for j in i.split(' ')]))
    
    print(res2)
    

    The result (res2) should be like ['1+3+4', '3+4', '4+4+3']

    I think this is what you want...