Search code examples
pythonpandasdataframefeature-extractionfeature-engineering

complicated list column to column string matching and deriving another column


Dataframes:

df1:
ind_lst
[agriculture_dairy, analytics]
[architecture_planning, advertising_pr_events, analytics]

df2:
ind    score
advertising_pr_events   3.672947168
agriculture_dairy   3.368266582
airlines_aviation_aerospace 3.60798955
analytics   3.67752552
architecture_planning   3.853236675

I have 2 data frames as shown above.

Column ind of df2 is of type string & has industry names. df2.score has a unique score for each industry.

Column ind_lst of df1 is a list of industry names formed by different combinations of unique elements of df2.ind

I want to create a column in df1 avg_score such that the final output is as below:

avg_score in row1 of df1 is avg of the scores from df2 corresponding to agriculture_dairy, analytics

avg_score in row2 of df1 is avg of scores from df2 corresponding to architecture_planning, advertising_pr_events, analytics.

df1
ind_lst    avg_score
[agriculture_dairy, analytics]    3.522896051
[architecture_planning, advertising_pr_events, analytics]    3.734569787666667

How to implement this in python? I tried this:

avg=[]
avgs=[]
for i in df1.ind_lst:
    for j in i:
        a= df2[df2['ind'] == j].index
        avg.append(df2.loc[a]['score'])
    avgs.append(mean(avg))
    

df1['avg_score']= avgs

the code doesn't work well. I get an empty list as output.

NOTE: I have sampled the data frames & shared only a subset of the problem statement. the real dataframes df1 has 90k entries & df2 has 860 rows.


Solution

  • Option 1: Create a map of ind and score from df2. Use list comprehension to map the score to the industry and calculate mean.

    mapper = df2.set_index('ind')['score'].to_dict()
    df1['avg_score'] = df1['ind_lst'].apply(lambda x: np.mean([mapper[i] for i in x]))
    
        ind_lst                                             avg_score
    0   [agriculture_dairy, analytics]                      3.522896
    1   [architecture_planning, advertising_pr_events,...   3.734570
    

    Option 2: Use DataFrame.explode to unnest the list, groupby index and apply mean

    df1['avg_score'] = df1['ind_lst'].explode().map(mapper).groupby(level = 0).mean()