Search code examples
pythonlistcountvectorizer

Extract text count from a list of elements


I have a list containing text elements.

text = ['a for=apple','b for=ball', 'd for=dog', 'e for=elephant', 'a for=apple', 'd for=dog', '1.=one', '2.=two'] 

I need to get a count of text that is present before "=". I used CountVectorizer as below with a token pattern but it is not giving the expected results

print(text)
vectorizer = CountVectorizer()
vectorizer = CountVectorizer(token_pattern="^[^=]+")
vectorizer.fit(text)
print(vectorizer.vocabulary_)

Which gives output as below

{'a for': 2, 'b for': 3, 'd for': 4, 'e for': 5, '1.': 0, '2.': 1}

But the expected output should be

{'a for': 2, 'b for': 1, 'd for': 2, 'e for': 1, '1.': 1, '2.': 1}

Also i need to remove the "." from "1." so that my output would be

 {'a for': 2, 'b for': 1, 'd for': 2, 'e for': 1, '1': 1, '2': 1}

Is there any way i can do that ?


Solution

  • An easy way would be to use collections.Counter():

    >>> from collections import Counter
    >>> text = ['a for=apple','b for=ball', 'd for=dog', 'e for=elephant', 'a for=apple', 'd for=dog', '1.=one', '2.=two']
    >>> Counter(x.split('=')[0].replace('.', '') for x in text)
    Counter({'a for': 2, 'd for': 2, 'b for': 1, 'e for': 1, '1': 1, '2': 1})
    

    Which firsts splits each string in text by "=" into a list, and takes the first element from that. Then replace() is called to replace any instances of "." with "". Then finally, it returns a Counter() object of the counts.

    Note: If you want to return a pure dictionary at the end, you can wrap dict() to the last line.