lets say i have 2 csv files (very large files),
restaurant_id
, name
,star_rating
,city
,zone
,closed
restaurant_id
and category
So, what i want to do is basically add a column called zone_categories_intersection
to my features that tells me the number of restaurants in the same area (zone) that share at least one category with the restaurant in question.
Since it's the first time i use the pandas librairy, i have a little trouble getting fluent when manipulating tables. I did something like this to figure out the number of restaurants in the area associated with the restaurant in question and add it to my features column.
restaurants['nb_restaurants_zone'] = restaurants.groupby('zone')['zone'].transform('size')
restaurants.head()
features = restaurants[['restaurant_id', 'moyenne_etoiles', 'ville', 'zone', 'ferme', 'nb_restaurants_zone']].copy()
features.head()
#edit
merged = restaurants.merge(categories, on='restaurant_id')
merged.head()
I thought about adding the category.csv
file and merge it with restaurant and map the categorys with the corresponding id's and then figure out a way to apply the second condition (that share at least one category with the restaurant in question)... but i dont really know how to do any of those things
Thank you
Try this
# sample data
# (it's not exactly your provided data
# but it is better to show how the code works)
# please always provide a callable line of code
# you could get it with `df.head().to_dict('split')`
rest = pd.DataFrame({
'id': [1, 2, 3, 4, 5],
'name': ['Denny\'s', 'Ike\'s Love & Sandwiches', 'Midori Japanese',
'Pho U', 'John & Sons Oysters'],
'avg_stars': [2.5, 4, 3.5, 3.5, 4],
'city': ['Las Vegas', 'Phoenix', 'Calgary', 'Toronto', 'Toronto'],
'zone': ['a', 'a', 'b', 'b', 'a']
})
cats = pd.DataFrame([
[1, ['Breakfast', 'Dinners', 'American']],
[2, ['Sandwiches', 'American']],
[3, ['Japanese']],
[4, ['Japanese']],
[5, ['American', 'Seafood']]
], columns=['id', 'category']).explode('category')
The code
# add zone to categories dataframe
cats2 = cats.merge(rest[['id', 'zone']], on='id')
# add count for zone & category
cats2['zone_cat_count'] = (
cats2.groupby(['zone', 'category'])
.transform('count')
)
# merge with rest dataframe
rest = rest.merge(
cats2.groupby('id')['zone_cat_count'].max()
, on='id'
)
Output
id name avg_stars city zone zone_cat_count
0 1 Denny's 2.5 Las Vegas a 3
1 2 Ike's Love & Sandwiches 4.0 Phoenix a 3
2 3 Midori Japanese 3.5 Calgary b 2
3 4 Pho U 3.5 Toronto b 2
4 5 John & Sons Oysters 4.0 Toronto a 3