This is my neo4j db schema:
There is around 2.5 millions of Article nodes, 0.5 million of NamedEntityNodes and few thousand of Trend nodes. Articles have publication datetime and they are from about last two years. As an input from a user I get list of NamedEntitiesIds. And I want to have query which will find articles with best connections between those input NamedEntitiesIds and Trends from my database. Query is quite long because I am doing some scoring but I want to put it whole here.
MATCH (t:Trend)--(x:NamedEntity)-[xv:OCCUR]-(a:Article)-[v:OCCUR]-(n:NamedEntity)
WHERE n.id IN ["polski związek narciarski_orgName", "Polska_placeName_country", "Kamila Stoch_persName", "Kamila_persName_surname", "Stoch_persName_surname", "Innsbruck_placeName_settlement", "Bischofshofen_placeName_settlement", "niemiecki_placeName_country", "Oberstdorfie_placeName_settlement", "47_placeName_settlement", "Garmisch_placeName_settlement", "Partenkirchen.nTo_placeName_settlement", "Stoch_persName", "katowicki_placeName_settlement", "AWF.nTCS_orgName", "polski_placeName_country", "Polak_placeName_country", "Adam Małysz_persName", "Adam_persName_forename", "Małysz_persName_surname", "Kamil Stoch_persName", "Kamil_persName_forename", "Piotr Żyła_persName", "Piotr_persName_forename", "żyć_persName_surname", "Stoch_persName_addName", "Kaczmarski_persName", "Kaczmarski_persName_surname"] and t.date > date(datetime($currentDay) - duration({days: $daysDelta})) and a.publication_datetime > datetime($currentDay) - duration({days: $daysDelta})
WITH a,t, collect(distinct n) as distinctLinkNes, collect(distinct x) as distinctTrendNes,
sum(
CASE
WHEN n.category in ['persName', 'orgName'] THEN 2*v.amount
WHEN n.category in ['persName_surname', 'persName_addName'] THEN 1.5*v.amount
WHEN n.category in ['date', 'time', 'persName_forename'] THEN 0.5*v.amount
ELSE 1.0*v.amount
END) as linkSum,
sum(
CASE
WHEN x.category in ['persName', 'orgName'] THEN 2*xv.amount
WHEN x.category in ['persName_surname', 'persName_addName'] THEN 1.5*xv.amount
WHEN x.category in ['date', 'time', 'persName_forename'] THEN 0.5*xv.amount
ELSE 1.0*xv.amount
END) as trendSum
WITH a,t, linkSum, trendSum,
reduce(total=0, ne in distinctLinkNes |
total +
CASE
WHEN ne.category in ['persName', 'orgName'] THEN 2
WHEN ne.category in ['persName_surname', 'persName_addName'] THEN 1.5
WHEN ne.category in ['date', 'time', 'persName_forename'] THEN 0.5
ELSE 1.0
END) as distinctLinkNesAmount,
reduce(total=0, ne in distinctTrendNes |
total +
CASE
WHEN ne.category in ['persName', 'orgName'] THEN 2
WHEN ne.category in ['persName_surname', 'persName_addName'] THEN 1.5
WHEN ne.category in ['date', 'time', 'persName_forename'] THEN 0.5
ELSE 1.0
END) as distinctTrendNesAmount
WITH a, t, distinctTrendNesAmount, trendSum, distinctLinkNesAmount, linkSum,
(3*distinctTrendNesAmount + trendSum) * t.hits / 1000 as trendScore,
(3*distinctLinkNesAmount + linkSum) * ($daysDelta - duration.between(a.publication_datetime, date($currentDay)).days) as articleScore
WITH a, t, distinctTrendNesAmount, trendSum, trendScore, distinctLinkNesAmount, linkSum, articleScore,
(articleScore + trendScore) as score
ORDER BY score DESC
RETURN a as article, t, distinctTrendNesAmount, trendSum, trendScore, distinctLinkNesAmount, linkSum, articleScore, score
LIMIT 30
daysDelta
will be usually latest 7-14 days. This query is pretty slow. Depending on the input NamedEntitiesId i takes from few seconds up to few minutes. I tried to debug this using PROFILE here it is the result:
From what I had read I should decrease cardinality. (or maybe something else I am happy for suggestions). But I have no idea how to do it in my query. When I have to do all this scoring. And in my query I will never have to get more then 30 results.
I would suggest a minor tweak in your schema. In the NamedEntity
node, store an additional property named multiplicationFactor
, which will store the values 2, 1.5, 1.0, 0.5
that you are using in your case statements. After analyzing the profile graph, I have noticed that the aggregation operations are more expensive than the graph traversals. So this one change should help a lot. Set the new property using this query:
MATCH (n:NamedEntity)
WITH n, CASE
WHEN n.category in ['persName', 'orgName'] THEN 2
WHEN n.category in ['persName_surname', 'persName_addName'] THEN 1.5
WHEN n.category in ['date', 'time', 'persName_forename'] THEN 0.5
ELSE 1.0
END AS multiplicationFactor
SET n.multiplicationFactor = multiplicationFactor
Your recommendation query will now become this:
MATCH (t:Trend)--(x:NamedEntity)-[xv:OCCUR]-(a:Article)-[v:OCCUR]-(n:NamedEntity)
WHERE n.id IN ["polski związek narciarski_orgName", "Polska_placeName_country", "Kamila Stoch_persName", "Kamila_persName_surname", "Stoch_persName_surname", "Innsbruck_placeName_settlement", "Bischofshofen_placeName_settlement", "niemiecki_placeName_country", "Oberstdorfie_placeName_settlement", "47_placeName_settlement", "Garmisch_placeName_settlement", "Partenkirchen.nTo_placeName_settlement", "Stoch_persName", "katowicki_placeName_settlement", "AWF.nTCS_orgName", "polski_placeName_country", "Polak_placeName_country", "Adam Małysz_persName", "Adam_persName_forename", "Małysz_persName_surname", "Kamil Stoch_persName", "Kamil_persName_forename", "Piotr Żyła_persName", "Piotr_persName_forename", "żyć_persName_surname", "Stoch_persName_addName", "Kaczmarski_persName", "Kaczmarski_persName_surname"] and t.date > date(datetime($currentDay) - duration({days: $daysDelta})) and a.publication_datetime > datetime($currentDay) - duration({days: $daysDelta})
WITH a,t, collect(distinct n) as distinctLinkNes, collect(distinct x) as distinctTrendNes,
sum(n.multiplicationFactor * v.amount) as linkSum,
sum(x.multiplicationFactor * xv.amount) as trendSum
WITH a,t, linkSum, trendSum,
reduce(total=0, ne in distinctLinkNes |
total + ne.multiplicationFactor) as distinctLinkNesAmount,
reduce(total=0, ne in distinctTrendNes |
total + ne.multiplicationFactor) as distinctTrendNesAmount
WITH a, t, distinctTrendNesAmount, trendSum, distinctLinkNesAmount, linkSum,
(3*distinctTrendNesAmount + trendSum) * t.hits / 1000 as trendScore,
(3*distinctLinkNesAmount + linkSum) * ($daysDelta - duration.between(a.publication_datetime, date($currentDay)).days) as articleScore
WITH a, t, distinctTrendNesAmount, trendSum, trendScore, distinctLinkNesAmount, linkSum, articleScore,
(articleScore + trendScore) as score
ORDER BY score DESC
RETURN a as article, t, distinctTrendNesAmount, trendSum, trendScore, distinctLinkNesAmount, linkSum, articleScore, score
LIMIT 30