Search code examples
pythonpandasdataframenumpypandas-groupby

GroupBy in a python-Pandas dataframe with score corresponding to occurence in columns


I have a dataframe of my city that contains the people's evaluation of neighbourhood problems and looks something like that:

personID hood problems                               problemTop1 problemTop2 problemTop3 other columns
0001     A    ['cars', 'air', 'trash', 'parks', ...] 'trash'     'cars'      'air'       ...
0002     D    ['cars', 'air', 'trash', 'parks', ...] 'cars'      'air'       'trash'     ...
0003     A    ['cars', 'air', 'trash', 'parks', ...] 'cars'      'parks'     'air'       ...
0004     M    ['cars', 'air', 'trash', 'parks', ...] 'parks'     'cars'      'air'       ...
...

Here is a short version of the dataframe with only the important columns. I would like to group the dataframe by neighbourhoods (hood) and give each of the possible problems a score. The score should be a weighted sum of how problematic the people perceived the problem and as an example should look like that - if in neighbourhood E 10 people perceived trash as problemTop1, 7 people as problemTop2 and 3 as problemTop3, then

E_score_trash = 10*3 + 7*2 + 3*1 = 47

My final output should show for each neighbourhood each problem's score, like this:

>>> df_scores_by_hood
hood 'air' 'cars' 'parks' 'trash'
A     386   871    321     984
B     436   719    504     620
C     383   230    848     601
D     881   524    231     783
...

Is there an easy way to do this with pandas, numpy or other functions? Tnx


Solution

  • I found a solution, but I am not happy with it - it's not really pandasic, it's not vectorized (it does loops) and I have the feeling there should be something simpler for my problem. Do you know how to improve it?

    # list of all the problems
    problems = ['cars', 'air', 'trash',...]
    
    for problem in problems:
        df[problem] = 0
        df[problem] = df.apply(lambda row: 3 if row['problemTop1']==problem
                               else 2 if row['problemTop2']==problem
                               else 1 if row['problemTop3']==problem
                               else 0, axis=1)