Search code examples
neo4jcypher

Optimazing neo4j cypher query for recommendation


This is my neo4j db schema:

enter image description here

There is around 2.5 millions of Article nodes, 0.5 million of NamedEntityNodes and few thousand of Trend nodes. Articles have publication datetime and they are from about last two years. As an input from a user I get list of NamedEntitiesIds. And I want to have query which will find articles with best connections between those input NamedEntitiesIds and Trends from my database. Query is quite long because I am doing some scoring but I want to put it whole here.

MATCH (t:Trend)--(x:NamedEntity)-[xv:OCCUR]-(a:Article)-[v:OCCUR]-(n:NamedEntity)
        WHERE n.id IN ["polski związek narciarski_orgName", "Polska_placeName_country", "Kamila Stoch_persName", "Kamila_persName_surname", "Stoch_persName_surname", "Innsbruck_placeName_settlement", "Bischofshofen_placeName_settlement", "niemiecki_placeName_country", "Oberstdorfie_placeName_settlement", "47_placeName_settlement", "Garmisch_placeName_settlement", "Partenkirchen.nTo_placeName_settlement", "Stoch_persName", "katowicki_placeName_settlement", "AWF.nTCS_orgName", "polski_placeName_country", "Polak_placeName_country", "Adam Małysz_persName", "Adam_persName_forename", "Małysz_persName_surname", "Kamil Stoch_persName", "Kamil_persName_forename", "Piotr Żyła_persName", "Piotr_persName_forename", "żyć_persName_surname", "Stoch_persName_addName", "Kaczmarski_persName", "Kaczmarski_persName_surname"] and t.date > date(datetime($currentDay) - duration({days: $daysDelta})) and a.publication_datetime > datetime($currentDay) - duration({days: $daysDelta})
        WITH a,t, collect(distinct n) as distinctLinkNes, collect(distinct x) as distinctTrendNes,
            sum(
                CASE
                    WHEN n.category in ['persName', 'orgName'] THEN 2*v.amount
                    WHEN n.category in ['persName_surname', 'persName_addName'] THEN 1.5*v.amount
                    WHEN n.category in ['date', 'time', 'persName_forename'] THEN 0.5*v.amount
                    ELSE 1.0*v.amount
                END) as linkSum,
            sum(
                CASE
                    WHEN x.category in ['persName', 'orgName'] THEN 2*xv.amount
                    WHEN x.category in ['persName_surname', 'persName_addName'] THEN 1.5*xv.amount
                    WHEN x.category in ['date', 'time', 'persName_forename'] THEN 0.5*xv.amount
                    ELSE 1.0*xv.amount
                END) as trendSum
        WITH a,t, linkSum, trendSum,
            reduce(total=0, ne in distinctLinkNes |
            total + 
            CASE
                WHEN ne.category in ['persName', 'orgName'] THEN 2
                WHEN ne.category in ['persName_surname', 'persName_addName'] THEN 1.5
                WHEN ne.category in ['date', 'time', 'persName_forename'] THEN 0.5
                ELSE 1.0
            END) as distinctLinkNesAmount,
            reduce(total=0, ne in distinctTrendNes |
            total + 
            CASE
                WHEN ne.category in ['persName', 'orgName'] THEN 2
                WHEN ne.category in ['persName_surname', 'persName_addName'] THEN 1.5
                WHEN ne.category in ['date', 'time', 'persName_forename'] THEN 0.5
                ELSE 1.0
            END) as distinctTrendNesAmount
        WITH a, t, distinctTrendNesAmount, trendSum, distinctLinkNesAmount, linkSum,
            (3*distinctTrendNesAmount + trendSum) * t.hits / 1000 as trendScore, 
            (3*distinctLinkNesAmount + linkSum) * ($daysDelta - duration.between(a.publication_datetime, date($currentDay)).days) as articleScore
        WITH a, t, distinctTrendNesAmount, trendSum, trendScore, distinctLinkNesAmount, linkSum, articleScore,
            (articleScore + trendScore) as score
        ORDER BY score DESC
        RETURN a as article, t, distinctTrendNesAmount, trendSum, trendScore, distinctLinkNesAmount, linkSum, articleScore, score
        LIMIT 30

daysDelta will be usually latest 7-14 days. This query is pretty slow. Depending on the input NamedEntitiesId i takes from few seconds up to few minutes. I tried to debug this using PROFILE here it is the result: enter image description here

From what I had read I should decrease cardinality. (or maybe something else I am happy for suggestions). But I have no idea how to do it in my query. When I have to do all this scoring. And in my query I will never have to get more then 30 results.


Solution

  • I would suggest a minor tweak in your schema. In the NamedEntity node, store an additional property named multiplicationFactor, which will store the values 2, 1.5, 1.0, 0.5 that you are using in your case statements. After analyzing the profile graph, I have noticed that the aggregation operations are more expensive than the graph traversals. So this one change should help a lot. Set the new property using this query:

    MATCH (n:NamedEntity)
    WITH n, CASE
                WHEN n.category in ['persName', 'orgName'] THEN 2
                WHEN n.category in ['persName_surname', 'persName_addName'] THEN 1.5
                WHEN n.category in ['date', 'time', 'persName_forename'] THEN 0.5
                ELSE 1.0
            END AS multiplicationFactor
    SET n.multiplicationFactor = multiplicationFactor
    

    Your recommendation query will now become this:

    MATCH (t:Trend)--(x:NamedEntity)-[xv:OCCUR]-(a:Article)-[v:OCCUR]-(n:NamedEntity)
    WHERE n.id IN ["polski związek narciarski_orgName", "Polska_placeName_country", "Kamila Stoch_persName", "Kamila_persName_surname", "Stoch_persName_surname", "Innsbruck_placeName_settlement", "Bischofshofen_placeName_settlement", "niemiecki_placeName_country", "Oberstdorfie_placeName_settlement", "47_placeName_settlement", "Garmisch_placeName_settlement", "Partenkirchen.nTo_placeName_settlement", "Stoch_persName", "katowicki_placeName_settlement", "AWF.nTCS_orgName", "polski_placeName_country", "Polak_placeName_country", "Adam Małysz_persName", "Adam_persName_forename", "Małysz_persName_surname", "Kamil Stoch_persName", "Kamil_persName_forename", "Piotr Żyła_persName", "Piotr_persName_forename", "żyć_persName_surname", "Stoch_persName_addName", "Kaczmarski_persName", "Kaczmarski_persName_surname"] and t.date > date(datetime($currentDay) - duration({days: $daysDelta})) and a.publication_datetime > datetime($currentDay) - duration({days: $daysDelta})
    WITH a,t, collect(distinct n) as distinctLinkNes, collect(distinct x) as distinctTrendNes,
                sum(n.multiplicationFactor * v.amount) as linkSum,
                sum(x.multiplicationFactor * xv.amount) as trendSum
    WITH a,t, linkSum, trendSum,
                reduce(total=0, ne in distinctLinkNes |
                total + ne.multiplicationFactor) as distinctLinkNesAmount,
                reduce(total=0, ne in distinctTrendNes |
                total + ne.multiplicationFactor) as distinctTrendNesAmount
    WITH a, t, distinctTrendNesAmount, trendSum, distinctLinkNesAmount, linkSum,
                (3*distinctTrendNesAmount + trendSum) * t.hits / 1000 as trendScore, 
                (3*distinctLinkNesAmount + linkSum) * ($daysDelta - duration.between(a.publication_datetime, date($currentDay)).days) as articleScore
    WITH a, t, distinctTrendNesAmount, trendSum, trendScore, distinctLinkNesAmount, linkSum, articleScore,
                (articleScore + trendScore) as score
    ORDER BY score DESC
    RETURN a as article, t, distinctTrendNesAmount, trendSum, trendScore, distinctLinkNesAmount, linkSum, articleScore, score
    LIMIT 30