As a best practice, I am trying to index a bunch of documents to Solr in one request instead of indexing one at a time. Now I have the problem that the files I am indexing are of different types (pdf, word document, text file, ...) and therefore have different metadata that gets extracted from Tika and indexed. I'd like to have certain fields/information for all files, regardless of the type, such as creator, creation date and path for example, but I don't know how to manually add fields when I index all the files at once. If I would index one file at a time, I could just add fields with request.setParam() but that is for the whole request and not for one file. And even if something like this is possible, how would I get information like the creator of a file in Java?
Is there a possibility to add fields for each file?
if(listOfFiles != null) {
for (File file : listOfFiles) {
if (file.isFile()) {
request.addFile(file, getContentType(file));
//add field only for this file?
}else{
//Folder, call the same method again -> recursion
request = addFilesToRequest(file, request);
}
}
}
As far as I know there is no way of submitting multiple files in the same requests. These requests are usually so heavy on processing anyway that lowering the amount of HTTP requests may not change the total processing time much.
If you want to speed it up, you can process all your files locally with Tika first (Tika is what's being used internally in Solr as well), then only submit the extracted data. That way you can multithread the extracting process, add the results to a queue and let the Solr submission process be performed as the queue grows - with all the content submitted to Solr in several larger batches (for example 1000 documents at a time).
This also allows you to scale your indexing process without having to add more Solr servers to make that part of the process go faster (if your Solr node can keep up with search traffic, it shouldn't be necessary to scale it just to process documents).
Using Tika manually also makes it easier to correct or change details while processing, such as file formats returning dates in different timezones etc. than what you expect.