I have a graph that I'm trying to filter based on the timestamp of the access edge:
I can run a query that works very well on small amounts of data to get all the accesses edges based on timestamp, and then map out A,B,C,D,E,F. I'm running into 2 problems which I'm not sure how to solve:
1- Scale (query 1 on gremlify), I can have multiple thousands of edges on the access, so say I run the query I get the 10000 edges back, along with there in/out V, even though there is a very limited set of in out in the case I can't seem to manage to dedup them without losing some data on C, D, E, F.
g.E().hasLabel('access').
has('timestamp', between(3, 5)).as('access').
outV().hasLabel('B').as('b').
inE('belongs').
as('belongs').
outV().as('c').
inE('forms').as('forms').
outV().hasLabel('D', 'E', 'F').
as('element').
select('access').
inV().as('a').
select(
'a',
'access',
'b',
'belongs',
'c',
'forms',
'element'
).by(valueMap(true))
2- Aggregate (query 2 on gremlify), on the access edge I have a property "outcome" which can have multiple values. I've tried aggregating the outcome between A and C, (there could be a lot of B's but I don't need them most of the time) but when I do that it seems to take the full aggregate of the graph, not just between those 2 vertices
g.E().hasLabel('access').
has('timestamp', between(3, 5)).as('access').
outV().hasLabel('B').as('b').
inE('belongs').
as('belongs').
outV().as('c').
inE('forms').as('forms').
outV().hasLabel('D', 'E', 'F').
as('element').
select('access').
inV().as('a').
project(
'a',
'belongs',
'c',
'forms',
'element',
'outcome'
).by(coalesce(
select('a').
label(),
constant('default')
)).
by(coalesce(
select('belongs').
label(),
constant('default')
)).
by(coalesce(
select('c').
label(),
constant('default')
)).
by(coalesce(
select('forms').
label(),
constant('default')
)).
by(coalesce(
select('element').
label(),
constant('default')
)).
by(select('access').
groupCount().
by(select('access').
values('outcome')))
in this case it always returns nok value for all edges, and the value is always 1 (number of edges basically, not the aggregate of edges between vertices)
{
"nok": 1
}
Sandbox: https://gremlify.com/1bos0bj1h03i/4
I'm pretty sure I'm missing something is my understanding of tinkerpop, any pointers would be great!
What was happening (I think, this is based on observation and limited knowledge of the gremlin engine). Each step would try to look at every edge in the graph (20k in my test env), so the goal was to try and reduce as much as possible the starting node set.
Another think I was trying to do is to do projects inside group(), but it seems that it still processes all the edge until you get out of the group. I'm not sure how that works exactly, but running the projects after the groups reduces the 20k traversals to 75.
What I ended up doing is:
filter(has('timestamp', between(1, 2)))
)group().
by(select('a')).
by(
group().
by(outV().in('belongs')).
by(groupCount().
by(values('outcome'))))
When doing this is significantly reduced the number of edges being traversed.
From there I made my projections etc... to format my response.
That moved the query time from:
>TOTAL - - 91957.436 -
To:
>TOTAL - - 960.994 -
Playground with working code: https://gremlify.com/8pifxw2uws5/1