Search code examples
mongodbuuidstemming

In MongoDB how is text stemmed if my text contains a UUID?


I'm investigation slow updates on a collection in MongoDB.
The former colleagues have chosen the string type for the _id field and basing indexes on other string fields.

Now I understand text indexes are stemmed, and I can imagine this can be quite heavy when a document is updated.
Also the content of the _id-field is a UUID. Now I don't fully understand how stemming works, but guessing each part of the UUID (part1-part2-part3-part4-etc) becomes a unique entry in the index, causing queries to be slow.

Can someone explain how stemming would work on text which contains a UUID?


Solution

  • Stemming only applies to string fields that are part of a text index. The options for the default _id index cannot be changed and an _id index cannot be a text index, so stemming is not applicable in this context. The _id value is a single entry in the index which must be unique.

    Now I don't fully understand how stemming works, but guessing each part of the UUID (part1-part2-part3-part4-etc) becomes a unique entry in the index, causing queries to be slow.

    Stemming uses language-specific heuristics to reduce words to their expected root forms. Stemming libraries have a notion of typical inflection rules for languages but do not have any understanding of valid words or grammar. It generally doesn't make sense to include a UUID field (or other random non-language string) in a text index definition.

    MongoDB text indexes use the open source Snowball library for stemming.

    Can someone explain how stemming would work on text which contains a UUID?

    The best approach would be to explain MongoDB $text queries to see exactly how they are parsed. However, there's also an online Snowball demo which can be useful if you want to quickly try stemming algorithms for different languages.

    A MongoDB text index or $text query will treat whitespace and most punctuation characters (including hyphens) as word delimiters, so part1-part2-part3-part4-etc would be split into 5 terms. Each term will be stemmed and any duplicate terms will be ignored. Terms made up of random letters or values like part1 won't have a root form outside of accidental matching by the stemming heuristics.

    For example, in English:

    • Words ending in a single s are generally plural. If you make up a random word like part4s , it will stem to part4.
    • Words ending in ss are generally not plural, so part4ss will be left unchanged.

    You can see how a phrase would be stemmed by explaining a text search query and looking at the terms for the parsedTextQuery.

    Using the mongo shell:

    > db.stores.createIndex( { name: "text", description: "text" } )
    > db.stores.find( { $text: { $search: "part1-part2-part3-part4-etc-part4s-part4ss" } } ).
           explain().queryPlanner.winningPlan.parsedTextQuery
    {
        "terms" : [
            "etc",
            "part1",
            "part2",
            "part3",
            "part4",
            "part4ss"
        ],
        "negatedTerms" : [ ],
        "phrases" : [ ],
        "negatedPhrases" : [ ]
    }
    

    I added part4s and part4ss to your example UUID. Since part4s stems to part4 (which is already a unique term) you'll notice my query only finds 6 terms instead of 7.