Search code examples
neo4jgraphenedb

Neo4j LOAD CSV Application Threads Blocked


I am trying to import about 500,000 lines of data from a CSV file using the LOAD CSV command in neo4j.

The data in the csv is organized like this:

Artists    |    Feature1    |    Feature2    |    Feature3

Each of the columns is filled with names of music artists, most of who show up more than once in a column. If an artist name shows up in any column, I want there to be a node for that artist. If that artists' name shows up more than once in a column an/or in more than one column, I would like for there to be one and only one node for that artist.

For each artist, I want to keep track of who they feature and who they feature for. Each row of the csv represents a song. For each song an artist makes (each row of the csv), I'd like to add a FEATURES relationship from the artist in the Artist column to the artist in the Features1/2/3 columns.

Here is the code I am using:

CREATE CONSTRAINT ON (a:Artist) ASSERT a.artistName IS UNIQUE;

USING PERIODIC COMMIT 50
LOAD CSV WITH HEADERS from 'https://aws.bigfile.csv' as line
MERGE (artist:Artist {artistName: line.Artist})
MERGE (feature1:Artist {artistName: line.Feature1})
MERGE (feature2:Artist {artistName: line.Feature2})
MERGE (feature3:Artist {artistName: line.Feature3})

MERGE (artist)-[f1:FEATURES]->(feature1) 
ON CREATE SET f1.strength = 1
ON MATCH SET f1.strength = f1.strength + 1

MERGE (artist)-[f2:FEATURES]->(feature2) 
ON CREATE SET f2.strength = 1
ON MATCH SET f2.strength = f2.strength + 1

MERGE (artist)-[f3:FEATURES]->(feature3) 
ON CREATE SET f3.strength = 1
ON MATCH SET f3.strength = f3.strength + 1

Desired behavior: the first occurrence of someone featuring another artist creates the FEATURES relationship and should set the strength property of the FEATURES relationship equal to 1. For every subsequent occurrence, the strength property goes up by 1. Therefore, an artist A who frequently features an artist B should have a relationship like (a)-[:FEATURES {strength: AHighNumber]->(b)

Relationships are directional and direction matters in this case (A featuring B is different than B featuring A).

There should be over 10,000 distinct artists and therefore nodes, but around 2,000 nodes I start getting issues with system timeout.

I am getting a bunch of the following message in the logs:

2017-12-30 10:54:04.268+0000 WARN [o.n.k.i.c.MonitorGc] GC Monitor: Application threads blocked for 467ms.

Is there any other information that may be useful in determining the issue? Any idea how I could restructure my code to avoid this issue? All help is greatly appreciated. Thanks!


Solution

  • If you paste the query (don't run it, just paste it) into the browser, you'll get a warning on the side, which, when expended, reads:

    The execution plan for this query contains the Eager operator, which forces all dependent data to be materialized in main memory before proceeding. Using LOAD CSV with a large data set in a query where the execution plan contains the Eager operator could potentially consume a lot of memory and is likely to not perform well. See the Neo4j Manual entry on the Eager operator for more information and hints on how problems could be avoided.

    That explains why you're seeing performance issues...it effectively cancels out any use of periodic commit.

    You can get eager operations in the query plan when you have multiple MERGEs on the same node label. The general approach to getting around this kind of issue is to spend a pass merging in just the nodes from a single variable in the CSV:

    USING PERIODIC COMMIT 50
    LOAD CSV WITH HEADERS from 'https://aws.bigfile.csv' as line
    MERGE (artist:Artist {artistName: line.Artist})
    

    If line.Artist in your CSV has all possible artists, then a single pass should work. However if there are other artists in line.Feature1 (and the others) that aren't in line.Artist, then you'll need to do another pass for each one in turn until all your nodes are loaded (you can also push PERIODIC COMMIT up to 5000 to 10000 or so).

    Note that your multiple MERGEs of the same relationship type to multiple nodes is also contributing to the EAGER operation. You should really see if you can get your CSV in a different format. You don't really need all those extra columns, line.Artist and line.Feature should be enough, and then you'll have multiple rows with the same artist and a different Feature, and you won't get caught up with eager operations.

    Your alternative, to avoid the eager operations and allow usage of periodic commit, is to to execute the query (after you've imported all your nodes) in 3 passes:

    USING PERIODIC COMMIT 10000
    LOAD CSV WITH HEADERS from 'https://aws.bigfile.csv' as line
    match (artist:Artist {artistName: line.Artist})
    match (feature:Artist {artistName: line.Feature1})
    
    MERGE (artist)-[f:FEATURES]->(feature1) 
    ON CREATE SET f.strength = 1
    ON MATCH SET f.strength = f.strength + 1
    

    Then repeat but for line.Feature2 and line.Feature3.