I'm using Data Import Handler (DIH) to create documents in solr. Each document will have zero or more attachments. The attachments' (e.g. PDFs, Word docs, etc.) content is parsed (via Tika) and stored along with a path to the attachment. The attachment's content (and path) is (are) not stored in the database (and I prefer not to do that).
I currently have a schema with all the fields needed by DIH. I then also added an attachmentContent and attachmentPath field as multiValued. However, when I use Solrj to add the documents, only one attachment (the last one added) is stored and indexed by solr. Here's the code:
ContentStreamUpdateRequest up = new ContentStreamUpdateRequest("/update/extract");
up.setParam("literal.id", id);
for (MultipartFile file : files) {
// skip over files where the client didn't provided a filename
if (file.getOriginalFilename().equals("")) {
continue;
}
File destFile = new File(destPath, file.getOriginalFilename());
try {
file.transferTo(destFile);
up.setParam("literal.attachmentPath", documentWebPath + acquisition.getId() + "/" + file.getOriginalFilename());
up.addFile(destFile);
} catch (IOException ioe) {
ioe.printStackTrace();
}
}
try {
up.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);
solrServer.request(up);
} catch (SolrServerException sse) {
sse.printStackTrace();
}catch (IOException ioe) {
ioe.printStackTrace();
}
How can I get multiple attachments (content and paths) to be stored by solr? Or is there a better way to accomplish this?
Solr has a limitation of having only one document indexed with the API.
If you want to have multiple documents indexed you can club them as a zip file (and apply patch) and have it indexed.