Search code examples
performanceneo4jsparqlquery-tuning

Neo4j performance tuning for small database


I'm working with a Neo4j 1.9.7 database with a little small graph:

Nodes               19.806
Properties          230.175
Relationship        83.853
Relationship types  3

I've imported this data via Sail to use the SPARQL Plugin to perform some query. My graph has 3 entities, BusinessProcess, ApplicationProcess, Database linked like below

BUS2 ----> APP4 -----> DB10
|            |-------> DB9
|
|------> APP5 -----> DB11
|                      ^
|------> APP6 ---------|

I'm running neo4j as a server and performing localhost query via Advanced Rest Client for Chrome since i use the database only for read data. When I perform the first 3 queries all goes good. For example this sparql query

SELECT ?O 
WHERE{ ?S ?P ?O. ?O ?P2 ?OO.} 
GROUP BY ?O HAVING((COUNT(?P2)=1))

return

[ {
  "O" : "http://neo4j.org#ApplicationProcessI"
} ]

in 6911 ms

that is acceptable for me ( I don't need real case scenario, just a comparision beetween a Neo4j solution and the same database put on MySQL, so i don't need a perfect tuning for both, just a human patience admittable one =) )

I got a problem with this query

PREFIX bus:<http://neo4j.org#BusinessProcess> 
SELECT ?bus ?db 
WHERE{ bus:9 ?p1 ?app. ?app ?p2 ?db. ?bus ?p1 ?app2. ?app2 ?p2 ?db. FILTER(?bus != bus:2) }
GROUP BY ?bus ?db

That should say which business processes share the same db with a particular one (BusinessProcess9 in the example). Here my database doens't seem to reach an answer in acceptable human time (it goes beyond the hour without reaching a conclusion). Reading the manual and the other similar problems I think this query shouldn't be so problematic. I tried to tune the database a bit, but things aren't improving, so I decided to ask for some help.

Store details:

-rw-rw-r-- 1 ivan ivan  14M lug  1 16:12 data/EASample2.db/neostore.propertystore.db.strings
-rw-rw-r-- 1 ivan ivan 9,0M lug  1 16:12 data/EASample2.db/neostore.propertystore.db
-rw-rw-r-- 1 ivan ivan 3,0M lug  1 16:24 data/EASample2.db/neostore.relationshipstore.db
-rw-rw-r-- 1 ivan ivan 340K lug  1 16:12 data/EASample2.db/neostore.propertystore.db.arrays
-rw-rw-r-- 1 ivan ivan 175K lug  1 16:12 data/EASample2.db/neostore.nodestore.db
-rw-rw-r-- 1 ivan ivan  380 lug  1 16:12 data/EASample2.db/neostore.propertystore.db.index.keys
-rw-rw-r-- 1 ivan ivan  152 lug  1 16:12 data/EASample2.db/neostore.relationshiptypestore.db.names
-rw-rw-r-- 1 ivan ivan   81 lug  1 16:12 data/EASample2.db/neostore.propertystore.db.index
-rw-rw-r-- 1 ivan ivan   54 lug  1 16:12 data/EASample2.db/neostore
-rw-rw-r-- 1 ivan ivan   15 lug  1 16:12 data/EASample2.db/neostore.relationshiptypestore.db
-rw-r--r-- 1 root root    9 lug  1 16:12 data/EASample2.db/neostore.id
-rw-r--r-- 1 root root    9 lug  1 16:12 data/EASample2.db/neostore.nodestore.db.id
-rw-r--r-- 1 root root    9 lug  1 16:12 data/EASample2.db/neostore.propertystore.db.arrays.id
-rw-r--r-- 1 root root    9 lug  1 16:12 data/EASample2.db/neostore.propertystore.db.id
-rw-r--r-- 1 root root    9 lug  1 16:12 data/EASample2.db/neostore.propertystore.db.index.id
-rw-r--r-- 1 root root    9 lug  1 16:12 data/EASample2.db/neostore.propertystore.db.index.keys.id
-rw-r--r-- 1 root root    9 lug  1 16:12 data/EASample2.db/neostore.propertystore.db.strings.id
-rw-r--r-- 1 root root    9 lug  1 16:12 data/EASample2.db/neostore.relationshipstore.db.id
-rw-r--r-- 1 root root    9 lug  1 16:12 data/EASample2.db/neostore.relationshiptypestore.db.id
-rw-r--r-- 1 root root    9 lug  1 16:12 data/EASample2.db/neostore.relationshiptypestore.db.names.id

Using

Neo4j 1.9.7
Java 1.7.0_55
CPU Intel Core i7-2630QM 2 Ghz / Turbo boost to 2.9 Ghz
4 GB DDR3 RAM
Ubuntu 14.04 LTS

neo4j.properties

# Default values for the low-level graph engine
#neostore.nodestore.db.mapped_memory=25M
#neostore.relationshipstore.db.mapped_memory=50M
#neostore.propertystore.db.mapped_memory=90M
#neostore.propertystore.db.strings.mapped_memory=130M
#neostore.propertystore.db.arrays.mapped_memory=130M

#add by me
use_memory_mapped_buffers=true

# Enable this to be able to upgrade a store from 1.4 -> 1.5 or 1.4 -> 1.6
#allow_store_upgrade=true

# Enable this to specify a parser other than the default one. 1.5, 1.6, 1.7 are available
#cypher_parser_version=1.6

# Keep logical logs, helps debugging but uses more disk space, enabled for
# legacy reasons To limit space needed to store historical logs use values such
# as: "7 days" or "100M size" instead of "true"
keep_logical_logs=true

# Autoindexing

# Enable auto-indexing for nodes, default is false
#node_auto_indexing=true

# The node property keys to be auto-indexed, if enabled
#node_keys_indexable=name,age

# Enable auto-indexing for relationships, default is false
#relationship_auto_indexing=true

# The relationship property keys to be auto-indexed, if enabled
#relationship_keys_indexable=name,age

neo4j-wrapper.conf

wrapper.java.additional=-Dorg.neo4j.server.properties=conf/neo4j-server.properties
wrapper.java.additional=-Djava.util.logging.config.file=conf/logging.properties
wrapper.java.additional=-Dlog4j.configuration=file:conf/log4j.properties

#********************************************************************
# JVM Parameters
#********************************************************************

wrapper.java.additional=-XX:+UseConcMarkSweepGC
wrapper.java.additional=-XX:+CMSClassUnloadingEnabled

# Uncomment the following lines to enable garbage collection logging
#wrapper.java.additional=-Xloggc:data/log/neo4j-gc.log
#wrapper.java.additional=-XX:+PrintGCDetails
#wrapper.java.additional=-XX:+PrintGCDateStamps
#wrapper.java.additional=-XX:+PrintGCApplicationStoppedTime
#wrapper.java.additional=-XX:+PrintPromotionFailure
#wrapper.java.additional=-XX:+PrintTenuringDistribution

# Uncomment the following lines to enable JVM startup diagnostics
#wrapper.java.additional=-XX:+PrintFlagsFinal
#wrapper.java.additional=-XX:+PrintFlagsInitial

# Java Heap Size: by default the Java heap size is dynamically
# calculated based on available system resources.
# Uncomment these lines to set specific initial and maximum
# heap size in MB.
wrapper.java.initmemory=128
wrapper.java.maxmemory=512

#********************************************************************
# Wrapper settings
#********************************************************************
# path is relative to the bin dir
wrapper.pidfile=../data/neo4j-server.pid

#********************************************************************
# Wrapper Windows NT/2000/XP Service Properties
#********************************************************************
# WARNING - Do not modify any of these properties when an application
#  using this configuration file has been installed as a service.
#  Please uninstall the service before modifying this section.  The
#  service can then be reinstalled.

# Name of the service
wrapper.name=neo4j

# User account to be used for linux installs. Will default to current
# user if not set.
wrapper.user=

As you can see i left the default configuration for the mapping (since my graph seems to fit them) and i changed the heap to 128M - 512M (that according to the manual should work) I noticed that just one CPU core is on full use during the query but for what i understood it's normal since traversal can happen just on one core at time. Also with jvisualvm i noticed that the process never hits the full heap usage (in the hour and more I left it going). GC doens't seem to be an issue since it stands around 1.4/1.6 % activity. CPU usage is around 20%

Got the same problem if i run the query in Cypher

start bus1=node(9)
match bus1-->app1-->db<--app2<--bus2
where bus1 <> bus2
return db.value, bus2.value

Is an hardware problem (my laptop isn't the best out there I know) and should i try with something with more RAM incrasing the also the heap (I can raise it till 1Gb to my laptop before it start using the swap) or there're some tune to do more than this? Or maybe is a query problem?

EDIT

After I wrote i tried to insert a distinct clause in the cypher query and i executed it cia shell (reading this answer I thought i could be having a similar problem). The query finishes after 83118 ms with 374K+ results. So I'm thinking isn't a tuning issue on the database but is the SPARQL query that is not well written.

The edit Cypher query

start bus1=node(9)
match bus1-->app1-->db<--app2<--bus2
where bus1 <> bus2
return distinct db.value, bus2.value

result (extract):

| "http://neo4j.org#Database214"  | "http://neo4j.org#BusinessProcess87"  |
| "http://neo4j.org#Database214"  | "http://neo4j.org#BusinessProcess37"  |
| "http://neo4j.org#Database214"  | "http://neo4j.org#BusinessProcess118" |
| "http://neo4j.org#Database214"  | "http://neo4j.org#BusinessProcess79"  |
| "http://neo4j.org#Database214"  | "http://neo4j.org#BusinessProcess39"  |
| "http://neo4j.org#Database214"  | "http://neo4j.org#BusinessProcess63"  |
| "http://neo4j.org#Database214"  | "http://neo4j.org#BusinessProcess112" |
| "http://neo4j.org#Database214"  | "http://neo4j.org#BusinessProcess82"  |
| "http://neo4j.org#Database214"  | "http://neo4j.org#BusinessProcess89"  |
| "http://neo4j.org#Database214"  | "http://neo4j.org#BusinessProcess40"  |
| "http://neo4j.org#Database214"  | "http://neo4j.org#BusinessProcess60"  |
+-------------------------------------------------------------------------+
374501 rows

13251 ms

Solution

  • It seemed to be a SPARQL query issue. Changing it to:

    PREFIX bus:<http://neo4j.org#BusinessProcess> 
    SELECT ?bus ?db 
    WHERE{ bus:9 ?p1 ?app. 
           ?app ?p2 ?db. 
           ?db ?p2 ?app2. 
           ?app2 ?p1 ?bus. 
           FILTER(?bus != bus:2) }
    GROUP BY ?bus ?db
    

    Made the query get an asnwer in ~50s at first start with 374k+ results. Removing the db from the select and the group by clauses, since the information about which database was shared wasn't primary in my case, was a second change to made the query perform better and get an anser in ~5s.