Search code examples
mongodbindexingfull-text-indexing

MongoDB full text index characters that are NOT stop characters (tokenization delimiters)


Suppose I want to have a "text index" on a text field as follows for partial and advanced searches:

"supertext": "a111=Salvador a111=Sal a111=Salv a111=Salva a111=Salvad a111=Salvado a113=Hernandez a113=Her a113=Hern a113=Herna a113=Hernan a113=Hernand"

It seems that the equal sign is one of the tokenization delimiters (stop characters) for the parser. This MongoDB doc page refers to the the unicode characters: Dash, Hyphen, Pattern_Syntax, Quotation_Mark, Terminal_Punctuation, and White_Space in Unicode 8.0 Character Database Prop List from here: https://www.unicode.org/Public/8.0.0/ucd/PropList.txt

What I'd like to know is the reverse. What special characters can I use that are NOT tokenization delimiters?

I want to find "a111=Salvador" in the text field. Right now, searching for "a111=Salvador" and just "Salvador" return the same or similar scores.

For example, what else can I use when I store the data, such as:

a111#Salvador
a111@Salvador
a111`Salvador

Seems like someone might have experience with this, rather than me spending hours searching that Unicode page for a character that is not there.

Or do I need a longer series of alpha characters, or no characters?

a111valueSalvador
a111Salvador

Solution

  • From current master https://github.com/mongodb/mongo/blob/eb2b72cf9c0269f086223d499ac9be8a270d268c/src/mongo/db/fts/unicode/gen_delimiter_list.py#L27 delimiters are:

    delim_properties = [
        "White_Space", "Dash", "Hyphen", "Quotation_Mark", "Terminal_Punctuation", "Pattern_Syntax",
        "STerm"
    ]
    

    which leaves you plenty of other symbols to chose from. Try middle dots for example:

    00B7          ; Other_ID_Continue # Po       MIDDLE DOT
    0387          ; Other_ID_Continue # Po       GREEK ANO TELEIA
    

    Tested with U+00B7 - a111·Salvador does the job and looks neat.

    In python terms:

    separator = '\u00B7'
    sample = "a111" + separator + "Salvador"
    print(sample)