Search code examples
python-3.xneo4jcyphergremlingremlin-server

Gremlin paging for big dataset query


I am using gremlin server, I have a big data set and I performing the gremlin paging. Following is the sample of query:

query = """g.V().both().both().count()"""
data = execute_query(query)
for x in range(0,int(data[0]/10000)+1):
    print(x*10000, " - ",(x+1)*10000)
    query = """g.V().both().both().range({0}*10000, {1}*10000)""".format(x,x+1)
    data = execute_query(query)

def execute_query(query):
    """query execution"""

Above query is working fine, for pagination i have to know the rang where to stop the execution of the query. for getting the range i have to first fetch the count of the query and pass to the for loop. Is there any other to use the pagination of gremlin.

-- Pagination is required, because its fails when fetching 100k data in a single ex. g.V().both().both().count()

if we don't use pagination then its giving me this following error:

ERROR:tornado.application:Uncaught exception, closing connection.
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/tornado/iostream.py", line 554, in wrapper
    return callback(*args)
  File "/usr/local/lib/python3.5/dist-packages/tornado/stack_context.py", line 343, in wrapped
    raise_exc_info(exc)
  File "<string>", line 3, in raise_exc_info
  File "/usr/local/lib/python3.5/dist-packages/tornado/stack_context.py", line 314, in wrapped
    ret = fn(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/tornado/websocket.py", line 807, in _on_frame_data
    self._receive_frame()
  File "/usr/local/lib/python3.5/dist-packages/tornado/websocket.py", line 697, in _receive_frame
    self.stream.read_bytes(2, self._on_frame_start)
  File "/usr/local/lib/python3.5/dist-packages/tornado/iostream.py", line 312, in read_bytes
    assert isinstance(num_bytes, numbers.Integral)
  File "/usr/lib/python3.5/abc.py", line 182, in __instancecheck__
    if subclass in cls._abc_cache:
  File "/usr/lib/python3.5/_weakrefset.py", line 75, in __contains__
    return wr in self.data
RecursionError: maximum recursion depth exceeded in comparison
ERROR:tornado.application:Exception in callback functools.partial(<function wrap.<locals>.null_wrapper at 0x7f3e1c409ae8>)
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/tornado/ioloop.py", line 604, in _run_callback
    ret = callback()
  File "/usr/local/lib/python3.5/dist-packages/tornado/stack_context.py", line 275, in null_wrapper
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/tornado/iostream.py", line 554, in wrapper
    return callback(*args)
  File "/usr/local/lib/python3.5/dist-packages/tornado/stack_context.py", line 343, in wrapped
    raise_exc_info(exc)
  File "<string>", line 3, in raise_exc_info
  File "/usr/local/lib/python3.5/dist-packages/tornado/stack_context.py", line 314, in wrapped
    ret = fn(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/tornado/websocket.py", line 807, in _on_frame_data
    self._receive_frame()
  File "/usr/local/lib/python3.5/dist-packages/tornado/websocket.py", line 697, in _receive_frame
    self.stream.read_bytes(2, self._on_frame_start)
  File "/usr/local/lib/python3.5/dist-packages/tornado/iostream.py", line 312, in read_bytes
    assert isinstance(num_bytes, numbers.Integral)
  File "/usr/lib/python3.5/abc.py", line 182, in __instancecheck__
    if subclass in cls._abc_cache:
  File "/usr/lib/python3.5/_weakrefset.py", line 75, in __contains__
    return wr in self.data
RecursionError: maximum recursion depth exceeded in comparison
Traceback (most recent call last):
  File "/home/rgupta/Documents/BitBucket/ecodrone/ecodrone/test2.py", line 59, in <module>
    data = execute_query(query)
  File "/home/rgupta/Documents/BitBucket/ecodrone/ecodrone/test2.py", line 53, in execute_query
    results = future_results.result()
  File "/usr/lib/python3.5/concurrent/futures/_base.py", line 405, in result
    return self.__get_result()
  File "/usr/lib/python3.5/concurrent/futures/_base.py", line 357, in __get_result
    raise self._exception
  File "/usr/local/lib/python3.5/dist-packages/gremlin_python/driver/resultset.py", line 81, in cb
    f.result()
  File "/usr/lib/python3.5/concurrent/futures/_base.py", line 398, in result
    return self.__get_result()
  File "/usr/lib/python3.5/concurrent/futures/_base.py", line 357, in __get_result
    raise self._exception
  File "/usr/lib/python3.5/concurrent/futures/thread.py", line 55, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/local/lib/python3.5/dist-packages/gremlin_python/driver/connection.py", line 77, in _receive
    self._protocol.data_received(data, self._results)
  File "/usr/local/lib/python3.5/dist-packages/gremlin_python/driver/protocol.py", line 100, in data_received
    self.data_received(data, results_dict)
  File "/usr/local/lib/python3.5/dist-packages/gremlin_python/driver/protocol.py", line 100, in data_received
    self.data_received(data, results_dict)
  File "/usr/local/lib/python3.5/dist-packages/gremlin_python/driver/protocol.py", line 100, in data_received
    self.data_received(data, results_dict)
  File "/usr/local/lib/python3.5/dist-packages/gremlin_python/driver/protocol.py", line 100, in data_received

this line repeats 100 times File "/usr/local/lib/python3.5/dist-packages/gremlin_python/driver/protocol.py", line 100, in data_received


Solution

  • This question is largely answered here but I'll add some more comment.

    Your approach to pagination is really expensive as I'm not aware of any graphs that will optimize that particular traversal and you're basically iterating all that data a lot of times. You do it once for the count(), then you iterate the first 10000, then for the second 10000, you iterate the first 10000 followed by the second 10000, and then on the third 10000, you iterate the first 20000 followed by the third 10000 and so on...

    I'm not sure if there is more to your logic, but what you have looks like a form of "batching" to get smaller bunches of results. There isn't much need to do it that way as Gremlin Server is already doing that for you internally. Were you to just send g.V().both().both() Gremlin Server is going to batch up results given the resultIterationBatchSize configuration option.

    Anyway, there isn't really a better way to make paging work that I am aware of beyond what was explained in the other question that I mentioned.