Indexing plain text files in Solr

Having the problem to find the proper well structured manual and information how to do the indexing for plain text in Solr (.txt).

I got the point how to work with the Solr standard data types, like .xml or .json but until now have not at least one structured and fully described manual for plain text indexing (especially if your file does not contain ids and there is only words and spaces).

Looking forward to receive some sources that can help me with this problem or some code examples which can be helpful for doing this.

Solution

You should still be able to use the extract endpoint (which uses Apache Tika in the background). You can provide field values through the query string as seen in the example for the techproducts data set:

/solr/techproducts/update/extract?literal.id=doc1&commit=true

The literal.id=doc1 parameter gives an actual value for the field that can't be extracted from the dataset submitted.

Make sure to set the Content-Type header to text/plain when you're submitting (unless you're submitting as a regular html form upload).