Search code examples
javaindexingluceneneo4jembedded-resource

Neo4j database exploding due to Lucene logs when properties are added to nodes


I am experiencing a problem with Neo4j, where the directory graph.db/index/ increases dramatically in size due to many large lucene.log.vXXX files being produced. This happens for a computation which does not use indexing at all, but simply adds numerical properties to some nodes in the network.

The problem is reproducible for versions 2.1.3, 2.1.7, and 2.2.0 on two different 64-bit computers running Ubuntu Linux (14.04.1 and 14.04.2).

My database:

  • 16’636’351 nodes with 4 properties: id (string), name (string), country code (string), and type (string).
  • 14’724’489 weighted links.

This results in a graph.db directory of 11 GB. The directory graph.db/index/ is 2.4 GB large.

I use Neo4j embedded in Java and always instantiate as follows:

        String i1 = "id";
        String i2 = "name";
        String i3 = "country";
        String i4 = "type";
        String myIndeables = i1 + "," + i2 + "," + i3 + "," + i4;
        GraphDatabaseService gdbs = new GraphDatabaseFactory().newEmbeddedDatabaseBuilder(cfg.dbPath).
                setConfig(GraphDatabaseSettings.node_keys_indexable, myIndeables).
                setConfig(GraphDatabaseSettings.node_auto_indexing, "true").
                setConfig(GraphDatabaseSettings.relationshipstore_mapped_memory_size, "12G").
                ...
                newGraphDatabase();

This way was also used to create (i.e., import) the original 11 GB database.

So far so good.

Now I perform a computation on the database. Ignoring the details, an algorithm calculates a kind of centrality measure for all the nodes in the largest connected component of the network (6’118’740 nodes).

The problem:
Simply adding these newly computed numbers as a property to the 6’118’740 nodes (out of the total of 16’636’351) results in the database exploding to 249 GB with a 243 GB graph.db/index/ directory (due to the the lucene.log.vXXX files)!!!

However, if I instantiate as follows without indexing...

        gdbs = new GraphDatabaseFactory().newEmbeddedDatabaseBuilder(cfg.dbPath).
                setConfig(GraphDatabaseSettings.relationshipstore_mapped_memory_size, "12G").
                ...
                newGraphDatabase();

...the result is a database size of 6.9 GB (recall the original was 11 GB!), of which now only 2.2 GB are used for graph.db/index/!!!

What is happening here?


PS
Additional information:

  • Java versions: Java(TM) SE Runtime Environment (build 1.7.0_76-b13) and OpenJDK Runtime Environment (IcedTea 2.5.4) (7u75-2.5.4-1~trusty1)
  • The jar file was exported from Eclipse.
  • The logs don't give any clues when going from the 11 GB database to the 249 GB version.

Solution

  • By default Neo4j keeps the logical logs for 7 days (older versions have a different value). Since you have auto indexing enabled any update to a node might cause a index update - which might be empty if you only change non-indexed properties.

    To prevent this shut down the database, make a backup copy and delete the lucene.log.vXXX files. In your startup code amend keep_logical_logs=false as a config option.