Search code examples
pdfpostsolrapache-tika

How to print the actual content of a pdf which matches the search query in solr 7.6.0


The solr version I am using is 7.6.0 (Schema-less Mode). I have tried to index few PDF documents using the Post utility jar provided by default. Now when I am doing a query, the details of file containing the query string are shown correctly. But I couldn't see any field with actual content of file presented. My Solrconfig.xml's Request Handler is given as follows

<requestHandler name="/update/extract" startup="lazy" class="solr.extraction.ExtractingRequestHandler" >
    <lst name="defaults">
      <str name="uprefix">ignored_</str>
      <str name="fmap.a">ignored_</str>
      <str name="fmap.div">ignored_</str>
      <str name="fmap.content">text</str>
      <str name="captureAttr">true</str>
      <str name="lowernames">true</str>
      <bool name="ignoreTikaException">true</bool>
    </lst>
</requestHandler>

When posted the pdf files for indexing, the auto generated managed-schema.xml file didn't contain any "Content" field in it. Also when queried, only the metadata of file like id, date, title, content-types, stream-size, author etc., are shown but not the actual content information highlighted. Please clarify. "http://localhost:8983/solr/TestCore6/select?hl=on&q=mars&wt=json"


Solution

  • Here is the solution that helped to fix my issue:

    The "text" field in schema comes with stored="false" by default. This field has to be made true for displaying the content information.

    Reference Link: Solr query in a pdf file, is not returning highlighting content