Search code examples
pythonneo4j

Stuck with neo4j with python


Hi I will try to keep on track but I've done a lot of research and now I just lost. I could really use some expertise here. Below is the situation:

Preface

This is a follow up question from my question here. The issue there was that my cypher queries were taking 1 second at the minimum to return a response. Even queries like RETURN 123 also took 1 second. Which lead to the conclusion Neo4j Bolt Driver for Python is slower than an actual http call to neo4j.

I can back this up with research from GitHub Issues and this from stackoverflow

The problem statement

Each time my code runs, it generates upto 10 Cypher queries and all those have to be fired and then operations need to be performed based on the results.

The issue is using Bolt the queries take 1 second to execute and with HTTP I am stuck. Since I want to use Query Parameters to make the query faster since now it's not Bolt as each http call now takes 30ms, multiply that by 10 {since I have 10 queries} and you have a very poor performing python API to fetch user relations. '

Where am I stuck

  1. A confirmation that yes, the Bolt driver is slow and that I am not doing anything wrong. Since all the posts I've seen are dated a year back
  2. My query has OR and AND conditions, how can I write those using parameters in neo4j REST Calls.
  3. Is there some other graph database I should look towards?
  4. Is there any way I can fire up to 10 queries and get a response time below 200ms?

Other reasons to think I am missing something:

  1. The legend has it, neo4j is the most popular graph database. How is it possible with such drivers?
  2. Over 1 year of reported issues with BOLT drivers and they still haven't fixed these issues.

Sample Request

curl -X POST \
  http://localhost:7474/db/data/cypher \
  -H 'Authorization: Basic bmVvNGo6Y29kZQ==' \
  -H 'Cache-Control: no-cache' \
  -H 'Content-Type: application/json' \
  -d '{
  "query" : "MATCH (ct:city)-[:CHILD_OF]->(st:state) WHERE (st.name_wr = {st}) AND (ct.name_wr= {ct}) RETURN st, ct",
  "params": 
  {
    "st" : "california",
    "ct" : "san francisco"
  }
}'

but what if I want to add a clause that either st should be California OR it can be Alaska AND ct must be san francisco, how do I do that with the parameters in REST

EDIT:

I replicated the script and below is the verdict:

58 transactions, tps 0.97 maxdelay 1.08

The curl sample request is the one that fire from postman. The code that I am using can be found from the linked question (in the preface).


Solution

  • EDIT

    Well to be honest the issue was with the IP I was using localhost and resolving the localhost was taking time. As soon as I switched to 127.0.0.1 it started working perfectly fine.

    Marking this as the answer as this answer helped to actually benchmark the two approaches that lead to the discovery of the issue in host resolution


    I think there must be something wrong with your setup. I've been using the python bolt driver for a while now, and for simple queries, I don't think I've ever seen a 1 second delay. I don't know what you code looks like, or your network delay, but I wrote a quick example to look at the delays I see in my local network (which has very low latency). Using Neo4j 3.2.9 and python driver 1.5.3.)

    #!/usr/bin/python
    from __future__ import print_function
    import sys
    import time
    from neo4j.v1 import GraphDatabase, basic_auth
    
    ip = '10.10.10.10'
    runtime = 60.0
    
    querystr = 'RETURN 123'
    runstart = time.time()
    maxdelay = 0
    cnt = 0
    #driver = GraphDatabase.driver("bolt+routing://%s:7687" % ip,
    driver = GraphDatabase.driver("bolt://%s:7687" % ip,
                                  auth=basic_auth("neo4j", "password"))
    while time.time() - runstart < runtime:
        start = time.time()
        session = driver.session(access_mode='READ')
        ret = session.run(querystr)
        session.close()
        result = ret.data()
        cnt += 1
        delay = time.time() - start
        if delay > maxdelay:
            maxdelay = delay
        if delay > 0.1:
            print('Large delay seen cnt %s delay %0.2f' % (cnt, delay))
    print('%d transactions, tps %0.2f maxdelay %0.2f' % (cnt, cnt/runtime, maxdelay))
    

    I get the output:

    117360 transactions, tps 1956.00 maxdelay 0.06

    This means the average read took about half a millisecond, and the max was 60ms.

    I would look at network latency and issues with resources on both your client and server side.