Search code examples
pythonpandasdataframefunctiondata-manipulation

Is there a Python function for counting the number of strings in a cell and reporting these in a new dataframe?


Let's stay I have a grocery list with one column titled "Groceries". In each row there is a list of strings, for example.

Groceries
apples, bananas, oranges
apples, bananas, bananas, pears
oranges, pears, bananas

Is there a way to count each string and add a "tally" is a new dataframe or similar thing with the appropriately labeled item? The dataframe would then look like:

apples oranges bananas pears
1 1 1 0
1 0 2 1
1 1 0 1

I can't find a function that will recognize strings and count them in the appropriate row/column with the string name. I am also pretty new to Python and am not sure what would go into creating a function that would do this.


Solution

  • You can split the string on commas, explode to multiple rows, get_dummies to transform to 0/1, and groupby.sum to aggregate:

    out = (pd
     .get_dummies(df['Groceries'].str.split(',\s*').explode())
     .groupby(level=0).sum()
    )
    

    Or similar with crosstab:

    s = df['Groceries'].str.split(',\s*').explode()
    out = pd.crosstab(s.index, s)
    

    output:

       apples  bananas  oranges  pears
    0       1        1        1      0
    1       1        2        0      1
    2       0        1        1      1