So I've come across an interesting issue. I'm trying to optimize my solr indexes in japan, for japanese characters.
Essentially the issue, is that Solr isn't recognizing that words with long marks, and without long marks, are the same word. I don't know Japanese, but I'm working with someone who does, and they've informed me that when you search ビームエキスパンダー it returns results, as it should.
But if you search ビームエキスパンダ, which is the same word, but minus the long mark at the end, that it doesn't return any results. Our content that is indexed all contains ビームエキスパンダー, but we want essentially solr to include content even if you don't search the long mark to include content with the long mark.
This is what our japanese schema looks like for the fields I am looking at.
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100" multiValued="true">
<analyzer type="index">
<tokenizer class="solr.JapaneseTokenizerFactory" mode="search" userDictionary="lang/userdict_ja.txt"/>
<filter class="solr.SynonymGraphFilterFactory" expand="true" ignoreCase="true" synonyms="synonyms.txt" />
<filter class="solr.JapaneseBaseFormFilterFactory"/>
<filter class="solr.JapanesePartOfSpeechStopFilterFactory" tags="lang/stoptags_ja.txt"/>
<filter class="solr.CJKWidthFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_ja.txt"/>
<filter class="solr.JapaneseKatakanaStemFilterFactory" minimumLength="4"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.WordDelimiterGraphFilterFactory" preserveOriginal="1" catenateWords="1"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.JapaneseTokenizerFactory" mode="search" userDictionary="lang/userdict_ja.txt"/>
<filter class="solr.SynonymGraphFilterFactory" expand="true" ignoreCase="true" synonyms="synonyms.txt"/>
<filter class="solr.JapaneseBaseFormFilterFactory"/>
<filter class="solr.JapanesePartOfSpeechStopFilterFactory" tags="lang/stoptags_ja.txt"/>
<filter class="solr.CJKWidthFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_ja.txt"/>
<filter class="solr.JapaneseKatakanaStemFilterFactory" minimumLength="4"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.WordDelimiterGraphFilterFactory" preserveOriginal="1" catenateWords="1"/>
</analyzer>
</fieldType>
When I search ビームエキスパンダ, without the long mark, this is how it's parsed.
"querystring":"ビームエキスパンダ",
"parsedquery":"+DisjunctionMaxQuery((((((+CategoryName_txt:ビーム +CategoryName_txt:エキス +CategoryName_txt:パンダ) CategoryName_txt:ビームエキスパンダ))~1)))",
"parsedquery_toString":"+(((((+CategoryName_txt:ビーム +CategoryName_txt:エキス +CategoryName_txt:パンダ) CategoryName_txt:ビームエキスパンダ))~1))",
When I search ビームエキスパンダー with the long mark at the end, this is how it's parsed.
"querystring":"ビームエキスパンダー",
"parsedquery":"+DisjunctionMaxQuery((((CategoryName_txt:ビーム CategoryName_txt:エキスパンダ)~2)))",
"parsedquery_toString":"+(((CategoryName_txt:ビーム CategoryName_txt:エキスパンダ)~2))",
Any help with this would be greatly appreciated.
-Paul
UPDATE Upon request, I've attached screenshots from my solr analysis for these terms.
To appears that the term in question here, which is Beam Expander. It being analyzed with the dash, as Beam Expander, which is perfect. Without the dash though, it's being analyzed as three seperate words.
ビーム which is beam. This is correct. But expander is being analyzed to the terms, エキス and パンダ, which according to Google Translate, means Extract and Panda.
I figured out this issue. I'm no Japanese expert, but from what I can tell one interesting thing about Japanese, is they don't use spaces to delineate the end of words. The phrase BeamSplitter, and BeamExtractPanda in japanese, are essentially the same word, and solr is just trying to do it's best to determine where to break up the words.
That's where user dictionaries come in. This file for me is located in the default location, lang/userdict_ja.txt.
I added the line below.. ビームエキスパンダ,ビーム エキスパンダ,ビーム エキスパンダ, Beam Expander
I may be wrong about this, but from what I can tell, the first column here should be the word being searched that's wrong, the second and third should be the same word, but with a space indicating where it should be segmented by the tokenizer.
I believe instances like this are unusual, so I'm fine with this as a fix and would rather keep the JapaneseTokenizerFactory and put in edge cases, than use the standardTokenizerFactory and be less optimized.
Thank you all for your help.
-Paul