I am new to Solr and have been working through the tutorial of 8.4.0. Having followed successfully the techproducts example using SolrCloud, I'm now trying to use a schemaless approach to index some PDF files. For that, I used the following, again from the tutorial, to index several files which are stored int the ~/Documents/pdf folder:
bin/solr create -c localpdf -s 2 - rf 2
bin/post -c localpdf ~/Documents/pdf
When executing the above, I get the following error:
SimplePostTool: WARNING: Response: <html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
<title>Error 404 Not Found</title>
</head>
<body><h2>HTTP ERROR 404</h2>
<p>Problem accessing /solr/localpdf/update/extract. Reason:
<pre> Not Found</pre></p>
</body>
</html>
SimplePostTool: WARNING: IOException while reading response: java.io.FileNotFoundException: http://localhost:8983/solr/localpdf/update/extract?resource.name=%2Fhome%2Fuser%2FDocuments%2Fpdf%2Ftest234.pdf&literal.id=%2Fhome%2Fuser%2FDocuments%2Fpdf%2Ftest234.pdf
Running the same command with techproducts
, i.e. running:
bin/post -c techproducts ~/Documents/pdf
at least finds the files (it gives me some other errors related to PDFBox and some fonts, but that's another matter)
I can add other files, for instance XML to localpdf
from the example/exampledocs folder, but not the pdfs.
What am I missing here?
You must configure your core / collection to load the extracting request handler - otherwise it's not available. The techproducts core does this by default. Add the jars to the list of jars to load:
<lib dir="${solr.install.dir:../../..}/contrib/extraction/lib" regex=".*\.jar" />
<lib dir="${solr.install.dir:../../..}/dist/" regex="solr-cell-\d.*\.jar" />
And add the request handler definition (from the guide linked above):
<requestHandler name="/update/extract" class="org.apache.solr.handler.extraction.ExtractingRequestHandler">
<lst name="defaults">
<str name="fmap.Last-Modified">last_modified</str>
<str name="uprefix">ignored_</str>
</lst>
<!--Optional. Specify a path to a tika configuration file. See the Tika docs for details.-->
<str name="tika.config">/my/path/to/tika.config</str>
<!-- Optional. Specify one or more date formats to parse. See DateUtil.DEFAULT_DATE_FORMATS
for default date formats -->
<lst name="date.formats">
<str>yyyy-MM-dd</str>
</lst>
<!-- Optional. Specify an external file containing parser-specific properties.
This file is located in the same directory as solrconfig.xml by default.-->
<str name="parseContext.config">parseContext.xml</str>
</requestHandler>