Search code examples
pythonpandasdictionarycategories

Extract value associated with column name on non-zero rows


I have two dfs(500x100 & 1300x2) and want to create a new column in the first one with which categories that occur on each row. To achieve this I need to fetch the category associated with the column name from second df. There might be several categories on same row.

df = pd.DataFrame({'apple': [0, 0, 1, 0], 
'strawberries': [0, 1, 1, 0], 
'cucumber': [1, 1, 0, 0], 
'hawthorn': [0, 1, 0, 1]
})

df2 = pd.DataFrame({'storage': ['apple', 'strawberries', 'cucumber', 'hawthorn'],
'category': ['fruits', 'berries', 'vegetables', 'berries']
})

I've found two potential solutions which both aims to fetch value from dict when value of row is != 0:

df2_dict = dict(zip(df2['storage'], df2['category']))
df['categories'] = pd.Series(df.columns[np.where(df!=0)[1]]).map(df2_dict)
|
df['categories'] = df.apply(lambda s: ', '.join(s.index[s.eq(1)]), axis = 1).map(df2_dict)

These works to some extent but for some reason only give me results on about 1/10 of the rows. Desired output would be:

df = pd.DataFrame({'apple': [0, 0, 1, 0], 
'strawberries': [0, 1, 1, 0], 
'cucumber': [1, 1, 0, 0], 
'hawthorn': [0, 1, 0, 1],
'categories': ['vegetables', 'berries, vegetables, berries',
'fruits, berries', 'berries' ]})

As of now column names are keys in dict. FYI the columns are dummies so only 0|1 in them.

Appreciate any smart solutions to this. xoxo


Solution

  • there might be easier ways of doing this but this works i think :)

    df = pd.DataFrame({'apple': [0, 0, 1, 0], 
    'strawberries': [0, 1, 1, 0], 
    'cucumber': [1, 1, 0, 0], 
    'hawthorn': [0, 1, 0, 1]})
    
    df2 = pd.DataFrame({'storage': ['apple', 'strawberries', 'cucumber', 'hawthorn'],
    'category': ['fruits', 'berries', 'vegetables', 'berries']})
    
    def cateogory (row):
        result = []
        for column in list(df.columns) :
            if row[column] == 1 :
                result.append (df2.loc[df2['storage'] == column]["category"])
        return [item for sublist in result for item in sublist]
    
    df['category']  = df.apply(lambda row :  cateogory(row) , axis=1 )
    

    Result :

       apple  strawberries  cucumber  hawthorn                        category
    0      0             0         1         0                    [vegetables]
    1      0             1         1         1  [berries, vegetables, berries]
    2      1             1         0         0               [fruits, berries]
    3      0             0         0         1                       [berries]
    
    

    btw edited your example, there were some mistakes in it

    Edit : corrected i think !