Search code examples
pythonpandasfeature-extractionfeature-engineering

Pandas: how to add column representing the intersection of 2 attributes in a Dataframe


lets say i have 2 csv files (very large files),

  • the first file represents restaurants and have 6 attributes restaurant_id, name,star_rating,city,zone,closed

enter image description here

  • the second file represents the categories of the restaurants and have 2 attributes restaurant_id and category enter image description here

So, what i want to do is basically add a column called zone_categories_intersection to my features that tells me the number of restaurants in the same area (zone) that share at least one category with the restaurant in question.

Since it's the first time i use the pandas librairy, i have a little trouble getting fluent when manipulating tables. I did something like this to figure out the number of restaurants in the area associated with the restaurant in question and add it to my features column.


restaurants['nb_restaurants_zone'] = restaurants.groupby('zone')['zone'].transform('size')
restaurants.head()

features = restaurants[['restaurant_id', 'moyenne_etoiles', 'ville', 'zone', 'ferme', 'nb_restaurants_zone']].copy()
features.head()

#edit
merged = restaurants.merge(categories, on='restaurant_id')
merged.head()

enter image description here

I thought about adding the category.csv file and merge it with restaurant and map the categorys with the corresponding id's and then figure out a way to apply the second condition (that share at least one category with the restaurant in question)... but i dont really know how to do any of those things

Thank you


Solution

  • Try this

    # sample data
    # (it's not exactly your provided data
    # but it is better to show how the code works)
    # please always provide a callable line of code
    # you could get it with `df.head().to_dict('split')`
    rest = pd.DataFrame({
        'id': [1, 2, 3, 4, 5],
        'name': ['Denny\'s', 'Ike\'s Love & Sandwiches', 'Midori Japanese',
            'Pho U', 'John & Sons Oysters'],
        'avg_stars': [2.5, 4, 3.5, 3.5, 4],
        'city': ['Las Vegas', 'Phoenix', 'Calgary', 'Toronto', 'Toronto'],
        'zone': ['a', 'a', 'b', 'b', 'a']
    })
    cats = pd.DataFrame([
        [1, ['Breakfast', 'Dinners', 'American']],
        [2, ['Sandwiches', 'American']],
        [3, ['Japanese']],
        [4, ['Japanese']],
        [5, ['American', 'Seafood']]
    ], columns=['id', 'category']).explode('category')
    

    The code

    # add zone to categories dataframe
    cats2 = cats.merge(rest[['id', 'zone']], on='id')
    
    # add count for zone & category
    cats2['zone_cat_count'] = (
        cats2.groupby(['zone', 'category'])
        .transform('count')
    )
    
    # merge with rest dataframe
    rest = rest.merge(
        cats2.groupby('id')['zone_cat_count'].max()
        , on='id'
    )
    

    Output

       id                     name  avg_stars       city zone  zone_cat_count
    0   1                  Denny's        2.5  Las Vegas    a               3
    1   2  Ike's Love & Sandwiches        4.0    Phoenix    a               3
    2   3          Midori Japanese        3.5    Calgary    b               2
    3   4                    Pho U        3.5    Toronto    b               2
    4   5      John & Sons Oysters        4.0    Toronto    a               3