Search code examples
mongodbelasticsearchquery-optimizationdatabase-performance

Query nested array performance, Mongo vs ElasticSearch


I have a music app that has a job to find music recommendations based on a tag id.

There are two entities involved:

  • Song - a song record contains its name and a list of music tag ids (genres) this song belongs to
  • MusicTag - the music tag itself, includes id, name etc.

Data is currently stored in MongoDB.

The Songs collections in mongo have millions of songs, and each song has an average of 7 tag ids. The MusicTags has about 30K records.

The Songs collection looks like that:

[
  {
    name: "Metallica - one",
    tags: [
      "6018703624d8a5e8efa1b76e", // Rock
      "601861cc8cef62ba86765017", // Heavy metal
      "5fda07ac8db0615c1c503a46" // Hard Rock
    ]
  },
  {
    name: "Metallica - unforgiven",
    tags: [
      "6018703624d8a5e8efa1b76e", // Rock
      "5fda07ac8db0615c1c503a46", // Metal
    ]
  },
  {
    name: "Lady Gaga - Bad Romance",
    tags: [
      "5fc7b9f95e38e17282896b64", // Pop
      "5fc729be5e38e17282844eff", // Dance
    ]
  }
]

Given the tag "6018703624d8a5e8efa1b76e" (Rock), I want to query the Songs collection and find all songs that have Rock tag in their tags array.

In Mongo this is the query i'm doing:

db.songs.find({ tags: { $in: [ObjectId("6018703624d8a5e8efa1b76e")] }});

The performance of it is very bad (between 10 to 40 seconds and getting worst as long as the collection grows), I tried to index Mongo in various ways (the table contains more data that involve in the search, such as score and duration, but it's not relevant for now) but my queries are still take too long, I can't explain it (and I read a lot of official and unofficial stuff) but I have a feeling that holding the data in this nested form makes the index worthless and somehow still make a full scan on the table each time - but I can't prove it (the Mongo "explain" not really explained me something :) )

I'm thinking of using ElasticSearch for it, sync all songs data, and query it instead of the Mongo that will stay as the data SSOT and other lightweight ops.

But then the question remains open and I want to make sure: is in Elastic I can hold the data in that form (nested array inside song) or I need to represent it differently (e.g. flat it so every record will be song_tag index etc?

Thanks.


Solution

  • Elasticsearch doesn't offer a dedicated array type so what you'd typically do is define the mapping based on the type of the individual array items -- in your case a keyword:

    PUT songs
    {
      "mappings": {
        "properties": {
          "tags": {
            "type": "keyword"
          }
        }
      }
    }
    

    Then you'd index the docs:

    POST songs/_doc
    {
      "name": "Metallica - one",
      "tags": [
        "6018703624d8a5e8efa1b76e",
        "601861cc8cef62ba86765017",
        "5fda07ac8db0615c1c503a46"
      ]
    }
    

    and query the tags:

    POST songs/_search
    {
      "query": {
        "bool": {
          "must": [
            { ... other queries },
            {
              "terms": {
                "tags": [
                  "6018703624d8a5e8efa1b76e"     // one or more
                ]
              }
            }
          ]
        }
      }
    }
    

    The tags are unique keywords but are not human-readable so you'd need to keep the map of them vs. the actual genres somewhere. Since the genres are probably set once and rarely, if ever, updated, you could use nested fields too. But your tags would then become an array of key-value pairs:

    POST songs/_doc
    {
      "name": "Metallica - one",
      "tags": [
        {
          "tag": "6018703624d8a5e8efa1b76e",
          "genre": "Rock"
        }
        ...
      ]
    }
    

    The mapping would be slightly different and so would be the queries but now you wouldn't need the translation map, plus you could query or aggregate by human-readable values -- tags.genre.