Search code examples
pythonpandasdataframefeature-engineering

How to filter a column by greater than considering an index


I have a data frame representing the customers ratings of restaurants. star_rating is rating of the customer in this data frame.

  • What i want to do is to add a column nb_fave_rating in the same data frame that represents the total number of favorable reviews for a restaurant. I consider a "favorable" opinion if its number of stars is > = 3.
data = {'rating_id': ['1', '2','3','4','5','6','7','8','9'],
        'user_id': ['56', '13','56','99','99','13','12','88','45'],
        'restaurant_id':  ['xxx', 'xxx','yyy','yyy','xxx','zzz','zzz','eee','eee'],
        'star_rating': ['2.3', '3.7','1.2','5.0','1.0','3.2','1.0','2.2','0.2'],
        'rating_year': ['2012','2012','2020','2001','2020','2015','2000','2003','2004'],
        'first_year': ['2012', '2012','2001','2001','2012','2000','2000','2001','2001'],
        'last_year': ['2020', '2020','2020','2020','2020','2015','2015','2020','2020'],
        }


df = pd.DataFrame (data, columns = ['rating_id','user_id','restaurant_id','star_rating','rating_year','first_year','last_year'])
df['star_rating'] = df['star_rating'].astype(float)

positive_reviews = df[df.star_rating >= 3.0 ].groupby('restaurant_id')
positive_reviews.head()

From here, i dont know to to count the number of positive reviews for a restaurant and add it to the new column of my initial dataframe df.

The output expected would be something like this.

data = {'rating_id': ['1', '2','3','4','5','6','7','8','9'],
        'user_id': ['56', '13','56','99','99','13','12','88','45'],
        'restaurant_id':  ['xxx', 'xxx','yyy','yyy','xxx','zzz','zzz','eee','eee'],
        'star_rating': ['2.3', '3.7','1.2','5.0','1.0','3.2','1.0','2.2','0.2'],
        'rating_year': ['2012','2012','2020','2001','2020','2015','2000','2003','2004'],
        'first_year': ['2012', '2012','2001','2001','2012','2000','2000','2001','2001'],
        'last_year': ['2020', '2020','2020','2020','2020','2015','2015','2020','2020'],
        'nb_fave_rating': ['1', '1','1','1','1','1','1','0','0'],
        }

So i tried this and had a bunch of NaNs

df['nb_fave_rating']=df[df.star_rating >= 3.0 ].groupby('restaurant_id').agg({'star_rating': 'count'})
df.head()

enter image description here


Solution

  • Here is a potential solution with groupby and map:

    #filtering the data with >=3 ratings 
    filtered_data = df[df['star_rating'] >= 3]
    
    #creating a dict containing the counts of the all the favorable reviews
    d = filtered_data.groupby('restaurant_id')['star_rating'].count().to_dict()
    
    #mapping the dictionary to the restaurant_id to generate 'nb_fave_rating'
    df['nb_fave_rating'] = df['restaurant_id'].map(d)
    
    #taking care of `NaN` values 
    df.fillna(0,inplace=True)
    
    #making the column integer (just to match the requirements)
    df['nb_fave_rating'] = df['nb_fave_rating'].astype(int)
    
    print(df)
    

    Output:

      rating_id user_id restaurant_id  star_rating rating_year first_year last_year  nb_fave_rating
    0         1      56           xxx          2.3        2012       2012      2020               1
    1         2      13           xxx          3.7        2012       2012      2020               1
    2         3      56           yyy          1.2        2020       2001      2020               1
    3         4      99           yyy          5.0        2001       2001      2020               1
    4         5      99           xxx          1.0        2020       2012      2020               1
    5         6      13           zzz          3.2        2015       2000      2015               1
    6         7      12           zzz          1.0        2000       2000      2015               1
    7         8      88           eee          2.2        2003       2001      2020               0
    8         9      45           eee          0.2        2004       2001      2020               0