Search code examples
pythonpandasdataframefunctionapply

Find shapes of dataframes inside lambda functions


I have the following dataframe with pairs of strings in tuples

d = {'value': [['Red', 'Blue'],
               ['Blue', 'Yellow'],
               ['Blue', 'Yellow'],
               ['Yellow', 'Orange'],
               ['Green', 'Purple'],
               ['Purple', 'Yellow'],
               ['Yellow', 'Red'],
               ['Violet', 'Blue'],
               ['Blue', 'Green'],
               ['Green', 'Red'],
               ['Red', 'Brown'],
               ['Blue', 'Green']]}

df = pd.DataFrame(data = d)

And I want to find for each row probability, which can be calculated based on number of rows with same values

def find_prob(df, tup):
    
    d = df[df.new.apply(lambda x: x[0] == tup[0] and x[1] == tup[1])].shape[0]
    p = df[df.new.apply(lambda x: x[0] == tup[0])].shape[0]
    
    return d / p

df['probs'] = df.new.apply(lambda x: find_prob(df, x))

I know it's dumb to pass DataFrame in apply function so I want to known if there's a way to improve this logic

Desired output is:

enter image description here

P.S. I want to divide number of rows on number of rows, that start with first value of a tuple


Solution

  • You can use groupby().transform('size') to count each of the types:

    tuple_counts = df.groupby(df['value'].apply(tuple))['value'].transform('size')
    first_counts = df.groupby(df['value'].str[0])['value'].transform('size')
    df['prob'] = tuple_counts/first_counts
    

    Output:

                   value  prob
    0        [Red, Blue]   0.5
    1     [Blue, Yellow]   0.5
    2     [Blue, Yellow]   0.5
    3   [Yellow, Orange]   0.5
    4    [Green, Purple]   0.5
    5   [Purple, Yellow]   1.0
    6      [Yellow, Red]   0.5
    7     [Violet, Blue]   1.0
    8      [Blue, Green]   0.5
    9       [Green, Red]   0.5
    10      [Red, Brown]   0.5
    11     [Blue, Green]   0.5
    
    ​