indexing solr apache-tika cloudera-manager

index rich documents using solrcell and tika

I am a newbee in Solr search and currently working to get solr Cell work with Tika. Consider the following text file:

Name:                    Popeye
Nationality:             American

I would like Solr to return me two fields named 'name' and 'nationality' with the values popeye and american. To do this, I define two fields in my schema.xml file as

   <field name="name" type="text_general" indexed="true" stored="true"/>
   <field name="nationality" type="text_general" indexed="true" stored="true"/>

The text_general field is defined as

<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
  <analyzer type="index">
    <tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
    <!-- in this example, we will only use synonyms at query time
                 <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
    -->
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
    <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

In the solrconfig.xml file, I define the update/extract method

<requestHandler name="/update/extract" class="solr.extraction.ExtractingRequestHandler" >
    <lst name="defaults">
    <str name="lowernames">true</str>
    <str name="uprefix">attr_</str>
    <str name="captureAttr">true</str>

Finally, I run the command to index the document as

curl 'http://localhost:8983/solr/popeye_bio_collection_shard1_replica1/update/extract?literal.id=doc1&commit=true' -F "myfile=@/tmp/popeye_bio.txt"

The document gets indexed without error. When I use the query command as

curl 'http://localhost:8983/solr/popeye_bio_collection_shard1_replica1/select?q=*%3A*&wt=json&indent=true'

I get the output as

    {
    "responseHeader":{
    "status":0,
    "QTime":3,
    "params":{
      "indent":"true",
      "q":"*:*",
      "wt":"json"}},
      "response":{"numFound":1,"start":0,"docs":[
      {
        "attr_meta":["stream_source_info",
          "myfile",
          "stream_content_type",
          "text/plain",
          "stream_size",
          "206",
          "Content-Encoding",
          "windows-1252",
          "stream_name",
          "popeye_bio.txt",
          "Content-Type",
          "text/plain; charset=windows-1252"],
        "id":"doc1",
        "attr_stream_source_info":["myfile"],
        "attr_stream_content_type":["text/plain"],
        "attr_stream_size":["206"],
        "attr_content_encoding":["windows-1252"],
        "attr_stream_name":["popeye_bio.txt"],
        "attr_content_type":["text/plain; charset=windows-1252"],
        "attr_content":[" \n \n  \n  \n  \n  \n  \n  \n  \n \n  Name:                    Popeye\r\nNationality:             American\r\n \n  "],
        "_version_":1567726521681969152}]
  }}

As you can see, popeye and american are not indexed in the fields I have defined in the schema.xml file. What am I doing wrong here? I have tried changing the tokenizer as in text_general field type as <tokenizer class="solr.PatternTokenizerFactory" pattern=": "/>. But it does not make any difference. I would appreciate any help in this regard!

Solution

When you define a tokenizer you're just indicating to Solr that all the data that is sent in that field should be tokenized/processed with your configuration, but in the end, you're sending all your information into one field.

Solr assumes that your data is structured (1 document that has fields). So one analyzer/tokenizer can't create more fields. The function of an analyzer/tokenizer is basically just tokenizing and transforming the text that is going into the inverted index for searching.

What you can do is use the ScriptUpdateProcessor and define a pipeline to do your modifications (split one field into several) before the text get's into the tokenizer. Something like:

<processor class="solr.StatelessScriptUpdateProcessorFactory">
    <str name="script">splitField.js</str>
</processor>

And the splitField.js file could have something like:

function processAdd(cmd) {
    doc = cmd.solrDoc;  // org.apache.solr.common.SolrInputDocument
    field = doc.getFieldValue("attr_content");

    // split your attr_content text into two variables:
    // name and nationality, then

    doc.setField("name", name);
    doc.setField("nationality", nationality);
}

In an ideal world this should be handled outside of Solr, but with the ScriptUpdateProcessor you can accomplish what you want.