I have the following data in multiple columns:
col1 col2 col3
123456 ['mary','ralph', ''bob'] ['bob','sam']
456789 ['george','fred', susie'] ['ralph','mary', 'bob']
789123 ['mary', bob'] ['bob']
I eventually need a value_counts on each column. To get everything out of the lists, I am trying explode. I can get the values into their columns post-explode, no problem. But, those of you who know about explode, know that my value_counts would then be inflated because of the repetition of values that explode causes when applying it to multiple columns
Explode yields this for example:
col1 col2 col3
123456 mary bob
123456 mary sam
123456 mary george
123456 ralph bob
123456 ralph sam
123456 ralph george...etc.
Obviously, this throws off an accurate value_counts per column which is what I need. I have tried looping the explode over each column and then after each column explode matching the first col and the exploded column and remove duplicates, doesn't work. Always enjoying not being the smartest guy in the room (more to learn) so I sent this question to you pandas guru exploding with ideas. (see what I did there?). Thanks.
Expected output so that I can value_counts all columns except the col1 would be this:
123456 mary bob
123456 ralph sam
123456 bob george
456789 george ralph
456789 fred mary
456789 susie bob
789123 mary bob
789123 bob george
I you want the value_counts of the elements inside the lists, first you need to flatten the column and then take the value_counts, for example:
import pandas as pd
from itertools import chain
df = pd.DataFrame(data=[
[123456, ['mary', 'ralph', 'bob'], ['bob', 'sam', 'george']],
[456789, ['george', 'fred', 'susie'], ['ralph', 'mary', 'bob']],
[789123, ['mary', 'bob'], ['bob', 'george']]
], columns=['col1', 'col2', 'col3'])
print(pd.Series(chain.from_iterable(df['col2'])).value_counts())
Output
mary 2
bob 2
susie 1
george 1
fred 1
ralph 1
dtype: int64
The above result is the value_counts for col2
of your example.