performance cluster-computing mesos arangodb dcos

ArangoDB 3.0 cluster - zero improvement on read/write speeds?

I have an ArangoDB 3.0 cluster set-up through DC/OS 1.7, as shown here:

I tried two queries on this 3x co-ord, 6x server set-up. Each node has the following specs:

15GB RAM (4GB assigned per DB Primary via DC/OS)
8 cores
CoreOS

I tried two queries to test performance against the coins collection. No indices were added. The config of the collection is:

Wait for sync:  false
Type:   document
Status: loaded
Shards: 16
Replication factor: 1
Index buckets:  8

Write:

FOR i IN 1..1000000 INSERT {flip:RAND() > 0.5 ? 'h' : 't'} IN coins

Result:

Execution plan:

 Id   NodeType            Site          Est.   Comment
  1   SingletonNode       COOR             1   * ROOT
  2   CalculationNode     COOR             1     - LET #2 = 1 .. 1000000   /* range */   /* simple expression */
  3   EnumerateListNode   COOR       1000000     - FOR i IN #2   /* list iteration */
  4   CalculationNode     COOR       1000000       - LET #4 = { "flip" : ((RAND() > 0.5) ? "h" : "t") }   /* v8 expression */
  6   DistributeNode      COOR       1000000       - DISTRIBUTE
  7   RemoteNode          DBS        1000000       - REMOTE
  5   InsertNode          DBS              0       - INSERT #4 IN coins
  8   RemoteNode          COOR             0       - REMOTE
  9   GatherNode          COOR             0       - GATHER

Indexes used:
 none

Optimization rules applied:
 Id   RuleName
  1   remove-data-modification-out-variables
  2   distribute-in-cluster

Write query options:
 Option                   Value
 ignoreErrors             false
 waitForSync              false
 nullMeansRemove          false
 mergeObjects             true
 ignoreDocumentNotFound   false
 readCompleteInput        false

Read:

FOR coin IN coins COLLECT flip = coin.flip WITH COUNT INTO total RETURN {flip, total}

Result:

Execution plan:

 Id   NodeType                  Site          Est.   Comment
  1   SingletonNode             DBS              1   * ROOT
  2   EnumerateCollectionNode   DBS        1000000     - FOR coin IN coins   /* full collection scan */
  3   CalculationNode           DBS        1000000       - LET #3 = coin.`flip`   /* attribute expression */   /* collections used: coin : coins */
 10   RemoteNode                COOR       1000000       - REMOTE
 11   GatherNode                COOR       1000000       - GATHER
  4   CollectNode               COOR        800000       - COLLECT flip = #3 WITH COUNT INTO total   /* hash*/
  7   SortNode                  COOR        800000       - SORT flip ASC
  5   CalculationNode           COOR        800000       - LET #5 = { "flip" : flip, "total" : total }   /* simple expression */
  6   ReturnNode                COOR        800000       - RETURN #5

Indexes used:
 none

Optimization rules applied:
 Id   RuleName
  1   move-calculations-up
  2   move-calculations-down
  3   scatter-in-cluster
  4   distribute-filtercalc-to-cluster
  5   remove-unnecessary-remote-scatter

Then I scaled down to just 1x co-ordinator, and 1x server - reducing available RAM from 90GB / 48 cores, down to 15GB / 8 cores.

I expected write and read to show some difference. Here are the results of the same queries (after truncating the collection, and re-running):

Write:

Read:

Result - Almost identical execution times.

Questions:

Am I missing some kind of step re: explicit replication? (I tried 'rebalancing shards' - which caused some of the additional DB servers to be labeled as followers, but didn't make a difference to execution speed)
Is my collection config optimal? I opted for 16 shards based on a 'DBPrimary squared' recommendation in docs (my original set-up used 4x servers, and saw equivalent performance)
Are the queries I tried able to cluster effectively? Ranged loops, etc.
Are there sample queries I can try that will test whether the cluster is configured correctly, and should definitively prove read/write performance differences between 1x nodes vs. n nodes?

Solution

I think I can shed some light on these questions (being one of the core developers of ArangoDB and responsible for the distributed mode). The following comments consider ArangoDB Version 3.0.

A single AQL query in 3.0 and before uses only a single coordinator. Therefore deploying more coordinators does not speed up a single query, be it a writing or a reading query.

This is very hard to change because AQL organises a data pipeline across the cluster and it is difficult to involve more than one coordinator.

If you do writing queries, we currently still have an exclusive write lock on the involved collections in 3.0. Therefore, more shards or DBservers do not help to scale up the performance of AQL write queries. We will work on this restriction for 3.1 but this is not particularly easy either.

More DBservers and coordinators will speed up the throughput of single document reads and writes (when not using AQL), as has been shown in this blog post. Therefore, your write query can probably be performed much faster by using the standard document API with the new batch extensions.

For reading AQL queries, you will in general see a speedup if you use more servers, if the query can be parallelised across the shards, or if you measure the throughput of many such queries.

For your particular reading query with the aggregation, we are missing an AQL query optimizer rule with its accompanying infrastructure that can move the aggregations to the individual shards and then combine the results. This is planned but not yet implemented in 3.0, and therefore you do not see a speedup in your reading query. The explain output shows that the COLLECT with its SORT is executed on the coordinator and therefore all data has to be moved through the single coordinator.

As to your first question, replication will not help here either. If you set up synchronous replication, then in 3.0, all reads and writes go through a single server for each shard. Therefore, a higher replication factor does at this stage not increase your read performance.

We will be able to get around this limitation once we have proper cluster wide transactions, which is also planned but hasn't landed in 3.0.