I have a list containing text elements.
text = ['a for=apple','b for=ball', 'd for=dog', 'e for=elephant', 'a for=apple', 'd for=dog', '1.=one', '2.=two']
I need to get a count of text that is present before "=". I used CountVectorizer as below with a token pattern but it is not giving the expected results
print(text)
vectorizer = CountVectorizer()
vectorizer = CountVectorizer(token_pattern="^[^=]+")
vectorizer.fit(text)
print(vectorizer.vocabulary_)
Which gives output as below
{'a for': 2, 'b for': 3, 'd for': 4, 'e for': 5, '1.': 0, '2.': 1}
But the expected output should be
{'a for': 2, 'b for': 1, 'd for': 2, 'e for': 1, '1.': 1, '2.': 1}
Also i need to remove the "." from "1." so that my output would be
{'a for': 2, 'b for': 1, 'd for': 2, 'e for': 1, '1': 1, '2': 1}
Is there any way i can do that ?
An easy way would be to use collections.Counter()
:
>>> from collections import Counter
>>> text = ['a for=apple','b for=ball', 'd for=dog', 'e for=elephant', 'a for=apple', 'd for=dog', '1.=one', '2.=two']
>>> Counter(x.split('=')[0].replace('.', '') for x in text)
Counter({'a for': 2, 'd for': 2, 'b for': 1, 'e for': 1, '1': 1, '2': 1})
Which firsts splits each string in text by "="
into a list, and takes the first element from that. Then replace()
is called to replace any instances of "."
with ""
. Then finally, it returns a Counter()
object of the counts.
Note: If you want to return a pure dictionary at the end, you can wrap dict()
to the last line.