Search code examples
neo4jdata-modelingkeyword-search

Neo4j data-model of documents, keywords, and word stems for searching


My goal is to do two different kinds of searches for documents using neo4j. I'll use recipes(documents) for my example. Say I have a list of ingredients(key-words) on-hand (milk, butter, flour, salt, sugar, eggs...) and I have some recipes in my database with ingredients attached to each recipe. I'd like to input my list and get two different results. One would be the recipes that most closely include ALL the ingredients I entered. The second would be combinations of recipes that together include all of my ingredients.

Given: milk, butter, flour, salt, sugar, eggs

a search result for the first case might be:

1.)Sugar cookies

2.)Butter cookies

A result for the second might be:

1.)Flat bread and Gogel-Mogel

I'm reading in recipes to insert into neo4j, and pulling out the ingredients from the ingredients list at the top of each recipe, but then also from the recipe instructions. I want to weigh these differently, maybe 60/40 in favor of the ingredients list.

I would also like to stem each ingredient in case people enter similar words.

I'm struggling to come up with a good data model in neo4j. I plan for a user to enter English ingredients, and I will stem them in the background, and use that for searching on.

My first thought was: neo4j data model 1 This is intuitive to me but is a lot of hops to find all the recipes.

Next maybe this: neo4j data model 2

Which gets to the recipes directly from the stems, but I would need to pass recipe ids in the relationships(right?) to get to the actual ingredients.

Third, maybe combine them like this? neo4j data model 3 but there's lots of duplication.

Here are also some CYPHER statements to create the first idea:

//Create 4 recipes
create (r1:Recipe {rid:'1', title:'Sugar cookies'}), (r2:Recipe {rid:'2', title:'Butter cookies'}), 
(r3:Recipe {rid:'3', title:'Flat bread'}), (r4:Recipe {rid:'4', title:'Gogel-Mogel'}) 

//Adding some ingredients
merge (i1:Ingredient {ingredient:"salted butter"})
merge (i2:Ingredient {ingredient:"white sugar"})
merge (i3:Ingredient {ingredient:"brown sugar"})
merge (i4:Ingredient {ingredient:"all purpose flour"})
merge (i5:Ingredient {ingredient:"iodized salt"})
merge (i6:Ingredient {ingredient:"eggs"})
merge (i7:Ingredient {ingredient:"milk"})
merge (i8:Ingredient {ingredient:"powdered sugar"})
merge (i9:Ingredient {ingredient:"wheat flour"})
merge (i10:Ingredient {ingredient:"bananas"})
merge (i11:Ingredient {ingredient:"chocolate chips"})
merge (i12:Ingredient {ingredient:"raisins"})
merge (i13:Ingredient {ingredient:"unsalted butter"})
merge (i14:Ingredient {ingredient:"wheat flour"})
merge (i15:Ingredient {ingredient:"himalayan salt"})
merge (i16:Ingredient {ingredient:"chocolate bars"})
merge (i17:Ingredient {ingredient:"vanilla flavoring"})
merge (i18:Ingredient {ingredient:"vanilla"})

//Stems added to each ingredient
merge (i1)<-[:STEM_OF]-(s1:Stem {stem:"butter"})
merge (i2)<-[:STEM_OF]-(s2:Stem {stem:"sugar"})
merge (i3)<-[:STEM_OF]-(s2)
merge (i4)<-[:STEM_OF]-(s4:Stem {stem:"flour"})
merge (i5)<-[:STEM_OF]-(s5:Stem {stem:"salt"})
merge (i6)<-[:STEM_OF]-(s6:Stem {stem:"egg"})
merge (i7)<-[:STEM_OF]-(s7:Stem {stem:"milk"})
merge (i8)<-[:STEM_OF]-(s2)
merge (i9)<-[:STEM_OF]-(s4)
merge (i10)<-[:STEM_OF]-(s10:Stem {stem:"banana"})

merge (i11)<-[:STEM_OF]-(s11:Stem {stem:"chocolate"})
merge (i12)<-[:STEM_OF]-(s12:Stem {stem:"raisin"})
merge (i13)<-[:STEM_OF]-(s1)
merge (i14)<-[:STEM_OF]-(s4)
merge (i15)<-[:STEM_OF]-(s5)
merge (i16)<-[:STEM_OF]-(s11)
merge (i17)<-[:STEM_OF]-(s13:Stem {stem:"vanilla"})
merge (i18)<-[:STEM_OF]-(s13)


create (r1)<-[:INGREDIENTS_LIST{weight:.7}]-(i1)
create (r1)<-[:INGREDIENTS_LIST{weight:.6}]-(i2)    
create (r1)<-[:INGREDIENTS_LIST{weight:.5}]-(i4)
create (r1)<-[:INGREDIENTS_LIST{weight:.4}]-(i5)
create (r1)<-[:INGREDIENTS_LIST{weight:.4}]-(i6)
create (r1)<-[:INGREDIENTS_LIST{weight:.2}]-(i7)
create (r1)<-[:INGREDIENTS_LIST{weight:.1}]-(i18)

create (r2)<-[:INGREDIENTS_LIST{weight:.7}]-(i1)
create (r2)<-[:INGREDIENTS_LIST{weight:.6}]-(i3)    
create (r2)<-[:INGREDIENTS_LIST{weight:.5}]-(i4)
create (r2)<-[:INGREDIENTS_LIST{weight:.4}]-(i5)
create (r2)<-[:INGREDIENTS_LIST{weight:.3}]-(i6)
create (r2)<-[:INGREDIENTS_LIST{weight:.2}]-(i7)
create (r2)<-[:INGREDIENTS_LIST{weight:.1}]-(i18)

create (r3)<-[:INGREDIENTS_LIST{weight:.7}]-(i1)
create (r3)<-[:INGREDIENTS_LIST{weight:.6}]-(i5)
create (r3)<-[:INGREDIENTS_LIST{weight:.5}]-(i7)
create (r3)<-[:INGREDIENTS_LIST{weight:.4}]-(i9)

create (r4)<-[:INGREDIENTS_LIST{weight:.6}]-(i2)
create (r4)<-[:INGREDIENTS_LIST{weight:.5}]-(i6)



create (r1)<-[:INGREDIENTS_INSTR{weight:.2}]-(i1)
create (r1)<-[:INGREDIENTS_INSTR{weight:.2}]-(i2)   
create (r1)<-[:INGREDIENTS_INSTR{weight:.2}]-(i4)
create (r1)<-[:INGREDIENTS_INSTR{weight:.2}]-(i5)
create (r1)<-[:INGREDIENTS_INSTR{weight:.1}]-(i6)
create (r1)<-[:INGREDIENTS_INSTR{weight:.1}]-(i7)


create (r2)<-[:INGREDIENTS_INSTR{weight:.3}]-(i1)
create (r2)<-[:INGREDIENTS_INSTR{weight:.2}]-(i3)   
create (r2)<-[:INGREDIENTS_INSTR{weight:.2}]-(i4)
create (r2)<-[:INGREDIENTS_INSTR{weight:.2}]-(i5)
create (r2)<-[:INGREDIENTS_INSTR{weight:.2}]-(i6)
create (r2)<-[:INGREDIENTS_INSTR{weight:.1}]-(i7)


create (r3)<-[:INGREDIENTS_INSTR{weight:.3}]-(i1)
create (r3)<-[:INGREDIENTS_INSTR{weight:.3}]-(i5)
create (r3)<-[:INGREDIENTS_INSTR{weight:.1}]-(i7)
create (r3)<-[:INGREDIENTS_INSTR{weight:.1}]-(i9)

create (r4)<-[:INGREDIENTS_INSTR{weight:.3}]-(i2)
create (r4)<-[:INGREDIENTS_INSTR{weight:.3}]-(i6)

and a link to a neo4j console with the above statements: http://console.neo4j.org/?id=3o8y44

How much does neo4j care about multiple relationships? Also, I can do a single ingredient, but how would I put together a query to get recipes given more than one ingredient?

Edit: Thank you Michael! That got me further. I was able to expand your answer to this:

WITH split("egg, sugar, chocolate, milk, flour, salt",", ") as terms  UNWIND 
terms as term  MATCH (stem:Stem {stem:term})-[:STEM_OF]->
(ingredient:Ingredient)-[lst:INGREDIENTS_LIST]->(r:Recipe)  WITH r, 
size(terms) - count(distinct stem) as notCovered,  sum(lst.weight) as weight, 
collect(distinct stem.stem) as matched  RETURN r , notCovered,matched, weight 
ORDER BY notCovered ASC, weight DESC

and got a list of ingredients matched and a weight. How would I change the query to also show the weights of the :INGREDIENTS_INSTR relationship so I could use both weights at the same time for calculations? [lst:INGREDIENTS_LIST|INGREDIENTS_INSTR] isn't what I'd like.

EDIT:

This seems to work, is it correct?

WITH split("egg, sugar, chocolate, milk, flour, salt",", ") as terms   UNWIND 
terms as term   MATCH (stem:Stem {stem:term})-[:STEM_OF]->
(ingredient:Ingredient)-[lstl:INGREDIENTS_LIST]->(r:Recipe)<-
[lsti:INGREDIENTS_INSTR]-(ingredient:Ingredient) WITH r, size(terms) - 
count(distinct stem) as notCovered,  sum(lsti.weight) as wi, sum(lstl.weight) 
as wl, collect(distinct stem.stem) as matched   RETURN r , 
notCovered,matched, wl+wi ORDER BY notCovered ASC, wl+wi DESC

Also, could you help with the second query? Where given a list of ingredients, combinations of recipes would be returned that together include the given ingredients. Thanks again!


Solution

  • I would go with your version 1).

    Don't worry about the additional hops. You would put the information about the amount / weight on the relationship between recipe and actual ingredient.

    You can have multiple relationships.

    Here is an example query, that doesn't work with your dataset as you have no recipe which has all ingredients:

    WITH split("milk, butter, flour, salt, sugar, eggs",", ") as terms 
    UNWIND terms as term 
    MATCH (stem:Stem {stem:term})-[:STEM_OF]->(ingredient:Ingredient)-->(r:Recipe) 
    WITH r, size(terms) - count(distinct stem) as notCovered 
    RETURN r ORDER BY notCovered ASC LIMIT 2
    
    +-----------------------------------------+
    | r                                       |
    +-----------------------------------------+
    | Node[0]{rid:"1",title:"Sugar cookies"}  |
    | Node[1]{rid:"2",title:"Butter cookies"} |
    +-----------------------------------------+
    2 rows
    

    The following would be an optimization for large datasets:

    And for querying you would first find all ingredients, and then the recipes attached the most selective one (with the lowest degree).

    And then check the remaining ingredients against each recipe.

    WITH split("milk, butter, flour, salt, sugar, eggs",", ") as terms 
    MATCH (stem:Stem) WHERE stem.stem IN terms
    // highest selective stem first
    WITH stem, terms ORDER BY size((stem)-[:STEM_OF]->()) ASC
    WITH terms, collect(stem) as stems
    WITH head(stems) first, tail(stems) as rest, terms
    MATCH (first)-[:STEM_OF]->(ingredient:Ingredient)-->(r:Recipe) 
    WHERE size[other IN rest WHERE (other)-[:STEM_OF]->(:Ingredient)-->(r)] as covered
    WITH r, size(terms) - 1 - covered as notCovered 
    RETURN r ORDER BY notCovered ASC LIMIT 2