python pandas feature-extraction feature-engineering

Pandas: how to add column representing the intersection of 2 attributes in a Dataframe

lets say i have 2 csv files (very large files),

the first file represents restaurants and have 6 attributes restaurant_id, name,star_rating,city,zone,closed

the second file represents the categories of the restaurants and have 2 attributes restaurant_id and category

So, what i want to do is basically add a column called zone_categories_intersection to my features that tells me the number of restaurants in the same area (zone) that share at least one category with the restaurant in question.

Since it's the first time i use the pandas librairy, i have a little trouble getting fluent when manipulating tables. I did something like this to figure out the number of restaurants in the area associated with the restaurant in question and add it to my features column.


restaurants['nb_restaurants_zone'] = restaurants.groupby('zone')['zone'].transform('size')
restaurants.head()

features = restaurants[['restaurant_id', 'moyenne_etoiles', 'ville', 'zone', 'ferme', 'nb_restaurants_zone']].copy()
features.head()

#edit
merged = restaurants.merge(categories, on='restaurant_id')
merged.head()

I thought about adding the category.csv file and merge it with restaurant and map the categorys with the corresponding id's and then figure out a way to apply the second condition (that share at least one category with the restaurant in question)... but i dont really know how to do any of those things

Thank you

Solution

Try this

# sample data
# (it's not exactly your provided data
# but it is better to show how the code works)
# please always provide a callable line of code
# you could get it with `df.head().to_dict('split')`
rest = pd.DataFrame({
    'id': [1, 2, 3, 4, 5],
    'name': ['Denny\'s', 'Ike\'s Love & Sandwiches', 'Midori Japanese',
        'Pho U', 'John & Sons Oysters'],
    'avg_stars': [2.5, 4, 3.5, 3.5, 4],
    'city': ['Las Vegas', 'Phoenix', 'Calgary', 'Toronto', 'Toronto'],
    'zone': ['a', 'a', 'b', 'b', 'a']
})
cats = pd.DataFrame([
    [1, ['Breakfast', 'Dinners', 'American']],
    [2, ['Sandwiches', 'American']],
    [3, ['Japanese']],
    [4, ['Japanese']],
    [5, ['American', 'Seafood']]
], columns=['id', 'category']).explode('category')

The code

# add zone to categories dataframe
cats2 = cats.merge(rest[['id', 'zone']], on='id')

# add count for zone & category
cats2['zone_cat_count'] = (
    cats2.groupby(['zone', 'category'])
    .transform('count')
)

# merge with rest dataframe
rest = rest.merge(
    cats2.groupby('id')['zone_cat_count'].max()
    , on='id'
)

Output

   id                     name  avg_stars       city zone  zone_cat_count
0   1                  Denny's        2.5  Las Vegas    a               3
1   2  Ike's Love & Sandwiches        4.0    Phoenix    a               3
2   3          Midori Japanese        3.5    Calgary    b               2
3   4                    Pho U        3.5    Toronto    b               2
4   5      John & Sons Oysters        4.0    Toronto    a               3