Search code examples
graphneo4jcypher

Neo4j graph creation query


Neo4j: Issues with WHERE Scope and Query Optimization in Fraud Detection Graph

I'm new to Neo4j and trying to build a fraud detection graph using a dataset found online.

Environment:

  • Neo4j Desktop: 1.6.1
  • Neo4j DB Version: 5.24.0

Issues

I am facing two main problems with my Cypher query:

  1. Scope of WHERE Clauses:

    • The WHERE statements apply to the entire following part of the query instead of only the relevant section.
    • I tried using CALL statements but couldn't make them work correctly.
  2. Memory Issue & Query Performance:

    • When I try to run the entire query (with fixes for the first issue), I run into memory problems.
    • I need a way to optimize the query or split it into smaller subqueries.

Indexing

Before running the query, I created the following indexes:

CREATE INDEX person_index FOR (p:PERSONNE) ON (p.Numero_client);
CREATE INDEX phone_index FOR (t:TELEPHONE) ON (t.Numero);
CREATE INDEX car_index FOR (v:VOITURE) ON (v.Immatriculation);
CREATE INDEX email_index FOR (e:EMAIL) ON (e.Adresse_mail);
CREATE INDEX address_index FOR (a:ADRESSE) ON (a.Rue, a.Code_postal, a.Commune);

Full Query

LOAD CSV WITH HEADERS FROM 'file:///data.csv' AS row  

// 1⃣ - Create PERSONNE nodes  
MERGE (p:PERSONNE {Numero_client: row.anc_ref})  
SET p.Chainage = row.chainage,  
    p.Sexe = row.sexe,  
    p.cres_soc = row.cres_soc,  
    p.Age = row.age_cli,  
    p.Nom = row.nom_cli,  
    p.Prenom = row.pnom_cli,  
    p.Nom_conjoint = row.nom_cjt,  
    p.Prenom_conjoint = row.pnom_cjt  
WITH p, row  
WHERE row.cres_soc <> "00" AND row.dres_soc IS NOT NULL  
SET p.dres_soc = row.dres_soc  

// 2⃣ - Create TELEPHONE nodes and relationships  
WITH p, row  
UNWIND [row.num_tel1, row.num_tel2] AS tel  
WITH p, row, tel  
WHERE tel IS NOT NULL AND tel <> "" AND tel <> " "  
MERGE (t:TELEPHONE {Numero: tel})  
MERGE (p)-[:A_POUR_TELEPHONE {type: CASE WHEN tel = row.num_tel1 THEN "PRINCIPAL" ELSE "SECONDAIRE" END}]->(t)  

// 3⃣ - Create VOITURE nodes  
WITH p, row  
WHERE row.num_mnrl1 IS NOT NULL AND row.num_mnrl1 <> "" AND row.num_mnrl1 <> " "  
MERGE (v:VOITURE {Immatriculation: row.num_mnrl1})  
MERGE (p)-[:POSSEDE_VOITURE]->(v)  

// 4⃣ - Create EMAIL nodes  
WITH p, row  
WHERE row.mail IS NOT NULL AND row.mail <> "" AND row.mail <> " "  
MERGE (e:EMAIL {Adresse_mail: row.mail})  
MERGE (p)-[:A_POUR_MAIL]->(e)  

// 5⃣ - Create ADRESSE nodes  
WITH p, row  
WHERE row.rue IS NOT NULL AND row.cd_post IS NOT NULL AND row.commune IS NOT NULL  
MERGE (a:ADRESSE {Rue: row.rue, Code_postal: row.cd_post, Commune: row.commune})  
MERGE (p)-[:RESIDE_A]->(a);

Attempted Solution: Splitting Query

I tried breaking the query into smaller parts:

Step 1: Creating PERSONNE nodes

LOAD CSV WITH HEADERS FROM 'file:///data.csv' AS row  
MERGE (p:PERSONNE {Numero_client: row.anc_ref})  
SET p.Chainage = row.chainage,  
    p.Sexe = row.sexe,  
    p.cres_soc = row.cres_soc,  
    p.Age = row.age_cli,  
    p.Nom = row.nom_cli,  
    p.Prenom = row.pnom_cli,  
    p.Nom_conjoint = row.nom_cjt,  
    p.Prenom_conjoint = row.pnom_cjt  
WITH p, row  
WHERE row.cres_soc <> "00" AND row.dres_soc IS NOT NULL  
SET p.dres_soc = row.dres_soc  

Step 2: Adding TELEPHONE nodes and relationships

LOAD CSV WITH HEADERS FROM 'file:///data.csv' AS row  
MATCH (p:PERSONNE {Numero_client: row.anc_ref})  
WITH p, row  
UNWIND [row.num_tel1, row.num_tel2] AS tel  
WITH p, row, tel   
WHERE tel IS NOT NULL AND tel <> "" AND tel <> " "  
MERGE (t:TELEPHONE {Numero: tel})  
MERGE (p)-[:A_POUR_TELEPHONE {type: CASE WHEN tel = row.num_tel1 THEN "PRINCIPAL" ELSE "SECONDAIRE" END}]->(t)

This approach sometimes works, but most of the time, I get one of the following errors:

Error Message

ServiceUnavailable  
WebSocket connection failure. Due to security constraints in your web browser, the reason for the failure is not available to this Neo4j Driver. Please use your browser's development console to determine the root cause of the failure.  
Common reasons include:
- The database being unavailable  
- Using the wrong connection URL  
- Temporary network problems  
- Browser not trusting the Neo4j certificate  
WebSocket `readyState` is: 3

Or:

Neo.TransientError.General.OutOfMemoryError
There is not enough memory to perform the current task. 
Please try increasing 'server.memory.heap.max_size' in the neo4j configuration (normally in 'conf/neo4j.conf' or, if you are using Neo4j Desktop, found through the user interface)
or if you are running an embedded installation increase the heap by using '-Xmx' command line flag, and then restart the database.

Questions

  1. How can I ensure that WHERE statements only apply to the relevant section of my query?
  2. What are the best ways to optimize this query to prevent memory issues?
  3. Is there a correct way to split the query without running into the indefinite execution issue with MATCH after LOAD CSV?
  4. How do I troubleshoot the WebSocket connection failure in Neo4j Desktop?

Any help would be greatly appreciated!


Solution

  • This answer attempts to address questions 1, 2, and 3.

    You can use CALL subqueries to perform updates to the DB, and that can be done in ways that do not affect the number of rows in the outer query. For example, most of the subqueries in the example below are Unit subqueries that do not RETURN anything), which never affect the number of rows in the outer query. The first subquery does RETURN something (p), but it returns a value for every outer row (because it does not filter out any rows), so it also does not affect the number of outer rows.

    Also, you can have those subqueries execute multiple transactions, to process subquery rows in batches (of 1000, by default), to avoid running out of memory.

    Potentially, something like this may work for you:

    :auto // Only needed if running in Neo4j Browser
    
    LOAD CSV WITH HEADERS FROM 'file:///data.csv' AS row  
    
    // 1 - Create PERSONNE nodes  
    CALL (row) {
      MERGE (p:PERSONNE {Numero_client: row.anc_ref})  
      SET
        p.Chainage = row.chainage,
        p.Sexe = row.sexe,  
        p.cres_soc = row.cres_soc,
        p.dres_soc = CASE WHEN row.cres_soc <> "00" THEN row.dres_soc END,
        p.Age = row.age_cli,  
        p.Nom = row.nom_cli,  
        p.Prenom = row.pnom_cli,  
        p.Nom_conjoint = row.nom_cjt,  
        p.Prenom_conjoint = row.pnom_cjt
      RETURN p
    } IN TRANSACTIONS
    
    // 2 - Create TELEPHONE nodes and relationships  
    CALL (p, row) {
      WITH p, row  
      UNWIND [row.num_tel1, row.num_tel2] AS tel  
      WITH p, row, tel  
      WHERE tel IS NOT NULL AND tel <> "" AND tel <> " "  
      MERGE (t:TELEPHONE {Numero: tel})  
      MERGE (p)-[:A_POUR_TELEPHONE {type: CASE WHEN tel = row.num_tel1 THEN "PRINCIPAL" ELSE "SECONDAIRE" END}]->(t)
    } IN TRANSACTIONS
    
    // 3 - Create VOITURE nodes  
    CALL (p, row) {
      WITH p, row  
      WHERE row.num_mnrl1 IS NOT NULL AND row.num_mnrl1 <> "" AND row.num_mnrl1 <> " "  
      MERGE (v:VOITURE {Immatriculation: row.num_mnrl1})  
      MERGE (p)-[:POSSEDE_VOITURE]->(v)  
    } IN TRANSACTIONS
    
    // 4 - Create EMAIL nodes
    CALL (p, row) {
      WITH p, row  
      WHERE row.mail IS NOT NULL AND row.mail <> "" AND row.mail <> " "  
      MERGE (e:EMAIL {Adresse_mail: row.mail})  
      MERGE (p)-[:A_POUR_MAIL]->(e)  
    } IN TRANSACTIONS
    
    // 5 - Create ADRESSE nodes
    CALL (p, row) {
      WITH p, row  
      WHERE row.rue IS NOT NULL AND row.cd_post IS NOT NULL AND row.commune IS NOT NULL  
      MERGE (a:ADRESSE {Rue: row.rue, Code_postal: row.cd_post, Commune: row.commune})  
      MERGE (p)-[:RESIDE_A]->(a)
    } IN TRANSACTIONS
    

    Notes:

    • The :auto command should only be used if running in the Neo4j Browser (it is needed for the IN TRANSACTIONS functionality).
    • You may want to specify larger batch sizes for some or all CALL subqueries (see the docs for the syntax).
    • This query assumes you are using at least neo4j version 5.23. If you are not, you need to replace the variable-scope-clause syntax with prepended WITH clauses in each subquery.
    • This query can still run out of memory if the data file has a large number of rows (since their data, plus data for the new p nodes, have to held in memory), in which case you many want to split the file into smaller ones to be processed separately.