Search code examples
pythonpandasdataframetokenizeword-frequency

Word Count Distribution Pandas Dataframe


Need to do a word distribution count from a dataframe. Anyone know how to fix?

raw data:

word
apple pear
pear
best apple pear

desired output:

word    count
apple   2
pear    3
best    1

running this code:

rawData = pd.concat([rawData.groupby(rawData.word.str.split().str[0]).sum(),rawData.groupby(rawData.word.str.split().str[-1]).sum()]).reset_index()

getting this error:

ValueError: cannot insert keyword, already exists

Solution

  • Use str.split then explode each list into one column and finally use value_counts to count occurrences of each word:

    out = df['word'].str.split().explode().value_counts()
    print(out)
    
    # Output:
    pear     3
    apple    2
    best     1
    Name: word, dtype: int64
    

    Step by step:

    >>> df['word'].str.split()
    0          [apple, pear]
    1                 [pear]
    2    [best, apple, pear]
    Name: word, dtype: object
    
    >>> df['word'].str.split().explode()
    0    apple
    0     pear
    1     pear
    2     best
    2    apple
    2     pear
    Name: word, dtype: object
    
    >>> df['word'].str.split().explode().value_counts()
    pear     3
    apple    2
    best     1
    Name: word, dtype: int64
    

    Update

    To get exactly your expected outcome:

    >>> df['word'].str.split().explode().value_counts(sort=False) \
                  .rename('count').rename_axis('word').reset_index()
    
        word  count
    0  apple      2
    1   pear      3
    2   best      1
    

    Update 2

    Get value counts by country:

    data = {'country': [' US', ' US', ' US', ' UK', ' UK', ' UK', ' UK'], 
            'word': ['best pear', 'apple', 'apple pear',
                     'apple', 'apple', 'pear', 'apple pear ']}
    df = pd.DataFrame(data)
    
    out = df.assign(word=df['word'].str.split()) \
            .explode('word').value_counts() \
            .rename('count').reset_index()
    print(out)
    
    # Output:
       country   word  count
    0       UK  apple      3
    1       UK   pear      2
    2       US  apple      2
    3       US   pear      2
    4       US   best      1