find the largest count of ingredients

I have two df. In the first df, all the recipes, in the second, the ingredients of the recipes

Recipes

       RecipeId     Name
0      789         Pizza
1      790         Soup
2      791         Sushi
3      792         Cake
4      793        Sandwich
5      794         Bread
6      795        Sausage

Ingredients

     id  RecipeId         Name
0     1       789        Flour
1     2       789        Water
2   795       789      Sausage
3     3       789       Olives
4     4       789         Salt
5   794       793        Bread
6   795       793      Sausage
7     1       794        Flour
8     2       794        Water
9     4       794         Salt
10    5       795  Ground meat
11    4       795         Salt

1.How can I find the recipe with the most ingredients, including all ingredients, such as a sandwich, which has bread (flour, water, salt) and sausages (ground meat and salt) = 5 ingredients? 2.How can I find the recipe with the most nested recipes (not single ingredients) 3.And the ingredient most often used in recipes?

Solution

You cannot solve this easily with pandas only. This is a graph problem.

Here is your (directed) graph:

You can use network to build the graph from your DataFrame, the count the number of terminal descendants per node:

import networkx as nx

df = (recipes.merge(ingredients, on='RecipeId',
                    suffixes=('_recipe', '_ingredient'))
      )


G = nx.from_pandas_edgelist(df, source='Name_recipe', target='Name_ingredient',
                            create_using=nx.DiGraph)

counts = {n: sum(1 for x in nx.descendants(G, n) if G.out_degree(x)==0)
          for n in G.nodes}

output:

{'Pizza': 5,
 'Flour': 0,
 'Water': 0,
 'Sausage': 2,
 'Olives': 0,
 'Salt': 0,
 'Sandwich': 4,
 'Bread': 3,
 'Ground meat': 0}

left merge

If you want to count soup/sushi/cake you can use a left merge and skip the NaN:

import networkx as nx

df = (recipes.merge(ingredients, on='RecipeId',
                    suffixes=('_recipe', '_ingredient'),
                    how='left')
             .fillna('drop')
      )

G = nx.from_pandas_edgelist(df, source='Name_recipe', target='Name_ingredient',
                            create_using=nx.DiGraph)

G.remove_node('drop')

out = {n: sum(1 for x in nx.descendants(G, n) if G.out_degree(x)==0)
       for n in G.nodes}

output:

{'Pizza': 5,
 'Flour': 0,
 'Water': 0,
 'Sausage': 2,
 'Olives': 0,
 'Salt': 0,
 'Soup': 0,
 'Sushi': 0,
 'Cake': 0,
 'Sandwich': 4,
 'Bread': 3,
 'Ground meat': 0}

output as pandas Series

pd.Series(out).reindex(recipes['Name'])

output:

Name
Pizza       5
Soup        0
Sushi       0
Cake        0
Sandwich    4
Bread       3
Sausage     2
dtype: int64