Searching Solr index for concatenated words

I'm struggling with two similar use cases.

Here's an example document from my index:

{
        "id":"E850AC8D844010AFA76203B390DD3135",
        "brand_txt_en":"Tom Ford",
        "catch_all":["Tom Ford",
          "FT 5163",
          "Tom Ford",
          "FT 5163",
          "DARK HAVANA"],
        "model_txt_en":"FT 5163",
        "brand_txt_en_split":"Tom Ford",
        "model_txt_en_split":"FT 5163",
        "color_txt_en":"DARK HAVANA",
        "material_s":"acetato",
        "gender_s":"uomo",
        "shape_s":"Wayfarer",
        "lens_s":"cerchiata",
        "modelkey_s":"86_1_FT 5163",
        "sales_i":0,
        "brand_s":"Tom Ford",
        "model_s":"FT 5163",
        "color_s":"DARK HAVANA",
        "_version_":1569456572504997895
}

Query: brand_txt_en_split:tomford

No results!

Field type is Solr's default one:

<fieldType name="text_en_splitting" class="solr.TextField" autoGeneratePhraseQueries="true" positionIncrementGap="100">
    <analyzer type="index">
      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
      <filter class="solr.StopFilterFactory" words="lang/stopwords_en.txt" ignoreCase="true"/>
      <filter class="solr.WordDelimiterFilterFactory" catenateNumbers="1" generateNumberParts="1" splitOnCaseChange="1" generateWordParts="1" catenateAll="0" catenateWords="1"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
      <filter class="solr.PorterStemFilterFactory"/>
    </analyzer>
    <analyzer type="query">
      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
      <filter class="solr.SynonymFilterFactory" expand="true" ignoreCase="true" synonyms="synonyms.txt"/>
      <filter class="solr.StopFilterFactory" words="lang/stopwords_en.txt" ignoreCase="true"/>
      <filter class="solr.WordDelimiterFilterFactory" catenateNumbers="0" generateNumberParts="1" splitOnCaseChange="1" generateWordParts="1" catenateAll="0" catenateWords="0"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
      <filter class="solr.PorterStemFilterFactory"/>
    </analyzer>
  </fieldType>

I expect WordDelimiterFilterFactory to generate "tomford" token by concatenating words but it looks like that's not working as expected.

The 'inverse' use case is:

{ 
   ...  "model_txt_en_split": "The Clubmaster", ...
}

I want that document to be found after this query: club master

I guess I should use EdgeNGram filter for the latter case, but really can't get how to do that.

Thanks for your help

Solution

The WordDelimiterFilterFactory has the catenateWords and catenateAll. It works where you have :

catenateWords: (integer, default 0) If non-zero, maximal runs of word parts will be joined: "hot-spot-sensor's" -> "hotspotsensor"

catenateAll: (0/1, default 0) If non-zero, runs of word and number parts will be joined: "Zap-Master-9000" -> "ZapMaster9000"`

To remove the space between the words please try the below filter.

<filter class="solr.PatternReplaceFilterFactory" pattern="(\s+)" replacement="" replace="all" />

Once you add/update the schema.xml. Restart the server and re-index the data.

You can try the below fieldType for you field name.

<analyzer>
  <tokenizer class="solr.KeywordTokenizerFactory"/>
  <filter class="solr.EdgeNGramFilterFactory" minGramSize="4" maxGramSize="25"/>
</analyzer>

Input String: "John Oliver W Clane"

Tokenizer to Filter: "John Oliver W Clane"

Output Tokens :

"John", "John ", "John O", "John Ol", "John Oli", "John Oli", "John Oliv", "John Olive", "John Oliver", "John Oliver ", "John Oliver W", "John Oliver W "
, "John Oliver W C", "John Oliver W Cl", "John Oliver W Cla", "John Oliver W Clan", "John Oliver W Clane".

There is another filter you can try the same .

<filter class="solr.NGramFilterFactory" minGramSize="4" maxGramSize="25"/>

You can read more about the analyzers and filters Solr Analyzers and Filters