Search code examples
pythonpandasexplode

pandas explode avoid duplication of values


I have the following data in multiple columns:

    col1             col2                       col3
123456     ['mary','ralph', ''bob']     ['bob','sam']
456789     ['george','fred', susie']  ['ralph','mary', 'bob']
789123     ['mary', bob']             ['bob']

I eventually need a value_counts on each column. To get everything out of the lists, I am trying explode. I can get the values into their columns post-explode, no problem. But, those of you who know about explode, know that my value_counts would then be inflated because of the repetition of values that explode causes when applying it to multiple columns

Explode yields this for example:

  col1     col2     col3
123456     mary     bob
123456     mary     sam     
123456     mary     george
123456     ralph    bob
123456     ralph    sam     
123456     ralph    george...etc.

Obviously, this throws off an accurate value_counts per column which is what I need. I have tried looping the explode over each column and then after each column explode matching the first col and the exploded column and remove duplicates, doesn't work. Always enjoying not being the smartest guy in the room (more to learn) so I sent this question to you pandas guru exploding with ideas. (see what I did there?). Thanks.

Expected output so that I can value_counts all columns except the col1 would be this:

123456    mary     bob
123456   ralph     sam
123456     bob  george
456789  george   ralph
456789    fred    mary
456789   susie     bob
789123    mary     bob
789123     bob  george

Solution

  • I you want the value_counts of the elements inside the lists, first you need to flatten the column and then take the value_counts, for example:

    import pandas as pd
    from itertools import chain
    
    df = pd.DataFrame(data=[
        [123456, ['mary', 'ralph', 'bob'], ['bob', 'sam', 'george']],
        [456789, ['george', 'fred', 'susie'], ['ralph', 'mary', 'bob']],
        [789123, ['mary', 'bob'], ['bob', 'george']]
    ], columns=['col1', 'col2', 'col3'])
    
    print(pd.Series(chain.from_iterable(df['col2'])).value_counts())
    

    Output

    mary      2
    bob       2
    susie     1
    george    1
    fred      1
    ralph     1
    dtype: int64
    

    The above result is the value_counts for col2 of your example.