Search code examples
solrhighlighting

How to store documents content in Solr 6.4?


I'm trying to index documents using Windows version of post, using command like bellow:

java -Dc=docs -Dauto=yes -Dc=docs -Ddata=files -Drecursive=yes -jar
post.jar C:\docs

I can see that documents are indexed correctly but I want to store extracted text to use highlighting. I added to my managed-schema fields like:

<field name="text" type="text_general" multiValued="true" indexed="true" stored="true"/>
<field name="source" type="text_general" multiValued="true" indexed="true" stored="true"/>
<field name="content" type="text_general" multiValued="true" indexed="true" stored="true"/>
<field name="content" type="strings"/>

but it doesn't work and I cannot return in my search content of documents. How can I store text extracted from doc, docx, pdf files and return it in my query?


Solution

  • The bin/post (not sure about post.jar, but I believe so as well) will tell you what type it determined each file to be and to what handler it is submitted.

    For example, MSWord, PDF and so on are all going to the /extract handler, which uses Tika to extract the content.

    Then, if you look in the solrconfig.xml for the definition of the /extract handler, you will see the parameters that tell you how to map the extracted content, which includes the names of the fields. Then, you can make those fields stored and reindex.