Search code examples
arraysmongodbindexingmultikey

Optimizing for exact array matching over MongoDB collection


I have a collection where I only ever need to look up documents by whole arrays; I can’t think of any scenario where I’d want to look up documents by just one value of that array. Unfortunately, the multikey feature that is always activated for array values apparently can’t be deactivated.

In the documentation it says “The index will be used to lookup a subset of the values (currently the first one) and then the document will be inspected for the exact match.” I think this greatly decreases the performance in my case. Despite the index some lookups take 70 ms and some several minutes, because, depending on the first element, MongoDB sometimes has to search through a few thousand or several hundred thousand documents. At least that is my theory.

Is there some way to avoid this problem, or should I rather serialize my arrays and store them as strings?

Thanks in advance!


Solution

  • Perhaps you could use a subdocument like:

    {
      array_sub_doc: { arr: [1,2,3,4,5] }
    }
    

    So that you can do matches like:

    db.coll.ensureIndex({array_sub_doc:1});
    db.coll.find({array_sub_doc: {arr:[1,2,3,4,5]}})
    

    update I discovered what caused the failure for large arrays. Index keys > 800 bytes will not be indexed. So, if you have a large subdocument and you put an index on it, if it's greater than 800 bytes, and you try to search for it, you won't find it. If you take the index off and search again for the same subdocument, you'll find it (although it will be a full collection scan).

    This is documented here as a limitation, and will be removed in future releases: https://jira.mongodb.org/browse/SERVER-3372

    So, this will work in general for small arrays.

    Here's some test code in case anyone wants to try it out:

    var randomArray = function() {
      var len = 80;
      var randomarr = new Array();
      for (var i=0; i<len; i++) {
        randomarr.push(Math.floor(Math.random() *10000));
      }
      return randomarr;
    }
    
    var insert = function() {
      db.Test2.ensureIndex({array_sub_doc:1});
      for(var i=0;i<10000;i++) {
        db.Test2.save({array_sub_doc: {arr: randomArray()}});
      }
    }
    
    db.Test2.remove();
    insert();
    
    var one = db.Test2.findOne();
    db.Test2.findOne({array_sub_doc:one.array_sub_doc});
    
    //...
    
    db.Test2.find({array_sub_doc:one.array_sub_doc}).explain(0);
    /* outputs:
    {
      "cursor" : "BtreeCursor array_sub_doc_1",
      "nscanned" : 1,
      "nscannedObjects" : 1,
      ...
    */