Suppose I want to have a "text index" on a text field as follows for partial and advanced searches:
"supertext": "a111=Salvador a111=Sal a111=Salv a111=Salva a111=Salvad a111=Salvado a113=Hernandez a113=Her a113=Hern a113=Herna a113=Hernan a113=Hernand"
It seems that the equal sign is one of the tokenization delimiters (stop characters) for the parser. This MongoDB doc page refers to the the unicode characters: Dash, Hyphen, Pattern_Syntax, Quotation_Mark, Terminal_Punctuation, and White_Space in Unicode 8.0 Character Database Prop List from here: https://www.unicode.org/Public/8.0.0/ucd/PropList.txt
What I'd like to know is the reverse. What special characters can I use that are NOT tokenization delimiters?
I want to find "a111=Salvador" in the text field. Right now, searching for "a111=Salvador" and just "Salvador" return the same or similar scores.
For example, what else can I use when I store the data, such as:
a111#Salvador
a111@Salvador
a111`Salvador
Seems like someone might have experience with this, rather than me spending hours searching that Unicode page for a character that is not there.
Or do I need a longer series of alpha characters, or no characters?
a111valueSalvador
a111Salvador
From current master https://github.com/mongodb/mongo/blob/eb2b72cf9c0269f086223d499ac9be8a270d268c/src/mongo/db/fts/unicode/gen_delimiter_list.py#L27 delimiters are:
delim_properties = [
"White_Space", "Dash", "Hyphen", "Quotation_Mark", "Terminal_Punctuation", "Pattern_Syntax",
"STerm"
]
which leaves you plenty of other symbols to chose from. Try middle dots for example:
00B7 ; Other_ID_Continue # Po MIDDLE DOT
0387 ; Other_ID_Continue # Po GREEK ANO TELEIA
Tested with U+00B7 - a111·Salvador
does the job and looks neat.
In python terms:
separator = '\u00B7'
sample = "a111" + separator + "Salvador"
print(sample)