Search code examples
pandasdataframegraphnetworkx

find the largest count of ingredients


I have two df. In the first df, all the recipes, in the second, the ingredients of the recipes

Recipes

       RecipeId     Name
0      789         Pizza
1      790         Soup
2      791         Sushi
3      792         Cake
4      793        Sandwich
5      794         Bread
6      795        Sausage

Ingredients

     id  RecipeId         Name
0     1       789        Flour
1     2       789        Water
2   795       789      Sausage
3     3       789       Olives
4     4       789         Salt
5   794       793        Bread
6   795       793      Sausage
7     1       794        Flour
8     2       794        Water
9     4       794         Salt
10    5       795  Ground meat
11    4       795         Salt

1.How can I find the recipe with the most ingredients, including all ingredients, such as a sandwich, which has bread (flour, water, salt) and sausages (ground meat and salt) = 5 ingredients? 2.How can I find the recipe with the most nested recipes (not single ingredients) 3.And the ingredient most often used in recipes?


Solution

  • You cannot solve this easily with pandas only. This is a graph problem.

    Here is your (directed) graph:

    graph

    You can use network to build the graph from your DataFrame, the count the number of terminal descendants per node:

    import networkx as nx
    
    df = (recipes.merge(ingredients, on='RecipeId',
                        suffixes=('_recipe', '_ingredient'))
          )
    
    
    G = nx.from_pandas_edgelist(df, source='Name_recipe', target='Name_ingredient',
                                create_using=nx.DiGraph)
    
    counts = {n: sum(1 for x in nx.descendants(G, n) if G.out_degree(x)==0)
              for n in G.nodes}
    

    output:

    {'Pizza': 5,
     'Flour': 0,
     'Water': 0,
     'Sausage': 2,
     'Olives': 0,
     'Salt': 0,
     'Sandwich': 4,
     'Bread': 3,
     'Ground meat': 0}
    

    left merge

    If you want to count soup/sushi/cake you can use a left merge and skip the NaN:

    enter image description here

    import networkx as nx
    
    df = (recipes.merge(ingredients, on='RecipeId',
                        suffixes=('_recipe', '_ingredient'),
                        how='left')
                 .fillna('drop')
          )
    
    G = nx.from_pandas_edgelist(df, source='Name_recipe', target='Name_ingredient',
                                create_using=nx.DiGraph)
    
    G.remove_node('drop')
    
    out = {n: sum(1 for x in nx.descendants(G, n) if G.out_degree(x)==0)
           for n in G.nodes}
    

    output:

    {'Pizza': 5,
     'Flour': 0,
     'Water': 0,
     'Sausage': 2,
     'Olives': 0,
     'Salt': 0,
     'Soup': 0,
     'Sushi': 0,
     'Cake': 0,
     'Sandwich': 4,
     'Bread': 3,
     'Ground meat': 0}
    

    output as pandas Series

    pd.Series(out).reindex(recipes['Name'])
    

    output:

    Name
    Pizza       5
    Soup        0
    Sushi       0
    Cake        0
    Sandwich    4
    Bread       3
    Sausage     2
    dtype: int64