Search code examples
solrtokenize

Solr tokenizer does not do anything


I want to tokenize one solr string field "content" to another field "tokenized". So e.g.:

{
  "content":"Hello World this is a Test",
  "tokenized":["hello", "world", "this", ...]
}

For that i use

<field name="content" type="string" indexed="true" stored="true"/>
<field name="tokenized" type="customType" indexed="true" stored="true"/>

<copyField source="content" dest="tokenized"/>

and the custom field type

<fieldType name="customType" class="solr.TextField">
   <analyzer>      
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.LowerCaseFilterFactory"/>
   </analyzer>
</fieldType>

My understanding was that upon committing all contents are tokenized with the specified tokenizer and then put, as a list of tokens, into the tokenized field. However the tokenized field only contains the content in a list, e.g.:

{
  "content":"Hello World this is a Test",
  "tokenized":["Hello World this is a Test"]
}

Is there some global configuration i need to make to get tokenizers to work?


Solution

  • Tokens are only stored internally in Lucene and Solr. They do not change the stored text that gets returned to you in any way. The text is stored verbatim - i.e. the text you sent in is what gets returned to you.

    The tokens generated in the background and stored in the index affect how you can search against the content you've stored and how it's processed, it does not affect the display value of the field.

    You can use the Analysis page under Solr's admin page to see exactly how text for a field gets processed into tokens before being stored in the index.

    The reason for this is that you're usually interested in returning the actual text to the user, making the tokenized and processed values visible doesn't really make sense for a document that gets returned to a human.