Suppose I have a pandas series like this:
0 "sun moon earth moon"
1 "sun saturn mercury saturn"
2 "sun earth mars"
3 "sun earth saturn sun saturn"
I want to get the top 3 words with the highest row ("document") frequency irrespective of the frequency within a single row ("document").
For overall frequency I can just collect all the words from all rows in a string, do a split, convert back to series and use value_counts
. In that case, the top 3 frequencies would be:
1. sun: 5
2. saturn: 4
3. earth: 3
But the document frequencies, i.e. the number of rows in which a word occurs, would be
1. sun: 4
2. earth: 3
3. saturn: 2
A way I can think of off the top of my head is to apply a lambda function to the series, splitting each string, making a set out of it, then combining all words into a single set, making a series out of that and then using value_counts
. Is there a more efficient way of achieving the same thing?
Because performance is important use Counter
:
from collections import Counter
a = Counter([y for x in s for y in x.split()]).most_common(3)
print (a)
[('sun', 5), ('saturn', 4), ('earth', 3)]
b = Counter([y for x in s for y in set(x.split())]).most_common(3)
print (b)
[('sun', 4), ('earth', 3), ('saturn', 2)]
df1 = pd.DataFrame(a, columns=['val','count'])
#df2 = pd.DataFrame(b, columns=['val','count'])
print (df1)
val count
0 sun 5
1 saturn 4
2 earth 3
Pandas alternatives:
a = s.str.split(expand=True).stack().value_counts().head(3)
print (a)
sun 5
saturn 4
earth 3
dtype: int64
b = (s.str.split(expand=True)
.stack()
.reset_index(name='val')
.drop_duplicates(['val', 'level_0'])['val']
.value_counts()
.head(3))
print (b)
sun 4
earth 3
saturn 2
Name: val, dtype: int64