Database design for time and geolocation data

I have successfully implemented the Time Series Use case like shown in the documentation. The data(Event class) pointed by the smallest time unit is indexed with a lucene spatial index.

I have two types of events : private or public.

How should I design my database and clusters for my use case ?

Would it be better to use an Edge or a linklist to make the link from Min to Event ?

I am worried that my lucene spatial index will get too big in the future.

By reading at the documentation, it looks like having clusters for the geolocation data would be a great strategy.

It is possible to use the index only on the subquery:

select 
from 
  (select 
     expand(month["12"].day["25"].hour["17"].min["07"].events)
   from 
     Year 
   where 
     year = 2015)
where
   [lat,lng,$spatial] 
NEAR 
   [66,66,{"maxDistance":1}]

The documentation on indexes tells me it is possible to use indexes on an edge properties. The bad side is that it takes more storage then linked list I tested it and it works :

select 
  expand(inV())
from 
  (select 
      expand(month["12"].day["25"].hour["17"].min["07"].outE('MinPublicEvent'))
   from 
     Year 
   where 
     year = 2015)
where
   [lat,lng,$spatial] 
NEAR 
   [66,66,{"maxDistance":1}]

Solution

In regard of edge vs link, taken from OrientDB doc: lightweight edges, the first difference is that edges can store properties and links don't.

These are the PROS and CONS of Lightweight Edges vs Regular Edges:

PROS:

faster in creation and traversing, because don't need an additional document to keep the relationships between 2 vertices

CONS:

cannot store properties harder working with Lightweight edges from SQL, because there is no a regular document under the edge

Since you already mentioned using properties on edges, which makes sense to me, as you can use these properties in the edges to transverse the graph, this means that you can't use a link to store that relationship.

In the case you want to embed these properties on the Event vertex, that is also fine, and you'd be able to use links, loosing the hability of using the properties in the edge to transverse the graph in favour of improved performance.

The edge approach is more expressive, but when performance really matters, and there is risk of a bottleneck, you should monitor the metrics and the performance, and refactor to the embed + link approach in case there is an issue with performance.

Update:

Clusters are basically a mechanism to split data in OrientDB (Clusters tutorial), which works for both, edge and vertex.

You may also find it beneficial to locate different clusters on different servers, physically separating where you store records in your database. The advantages of this include:

Optimization: Faster query execution against clusters, given that you need only search a subset of the clusters in a class.

Indexes: With good partitioning, you can reduce or remove the use of > indexes.

Parallel Queries: Queries can be run in parallel when made to data on multiple disks.

Sharding: You can shard large data-sets across multiple instances.

Unless you can identify clearly a good way to partition your data, and can distribute your database between different servers, i suggest you start with the default, as OrientDB already creates 1 cluster for each class in the schema, and add more clusters as your database grow.

When to add more clusters? Metrics, metrics and metrics. Keep track of how your application access your database, what kind of queries, amount of queries, etc.