Search code examples
pythonpandasfor-loopseriesmulti-index

using pandas find out repeated string elements from multi-index series dataframe


In Pandas dataframe, one of the column am having is series datatype i.e food_column, from this column I have to extract the output column

   Input : food_column
 [ 'bread','bread','bread'] ,
 [ 'meat','butter','butter'] ,
 [ 'meat', 'butter','bread','meat']
        ['butter']
 ['bread','meat','bread','meat']

Output : main_column
       ['bread']
       ['butter']
        ['meat']
       ['butter']
       ['bread']

condition:

  1. if any string element is repeated more than once that should be picked as an output element,
  2. if any two or three elements count are same it should be picked as np.random.choice from that two or three elements
  3. if only one element is there in any of the rows, assign/map that element to the output column
  4. else mark it as "unknown" to output column

Solution

  • import pandas as pd
    import random
    from collections import Counter
    import numpy as np
    
    food_list = [[ 'bread','bread','bread'] ,
     ['meat','butter','butter'] ,
     ['meat', 'butter','bread','meat'],
     ['butter'],
     ['bread','meat','bread','meat'],
     ['']]
    
    food_series = pd.Series(food_list)
    
    df = pd.DataFrame({'food_column': food_series})
    
    # randomize list item order, since dict item order is constant in Python 3.6+ 
    df['random_food_list'] = [random.sample(z, len(z)) for z in df['food_column'].to_list()]
    # get counts
    df['food_counts'] = df['random_food_list'].apply(lambda x: Counter(x))
    # get key with max value
    df['main_column'] = df['food_counts'].apply(lambda x:  max(x, key=x.get))
    # replace empty strings with 'unknown'
    df['main_column'] = np.where(df['main_column'] == '', 'unknown', df['main_column'])