Search code examples
xmllucenesolrdataimporthandler

How to use Solr DataImportHandler with XML Files?


I'm researching using DataImportHandler to import my data files utilizing FileDataSource with FileListEntityProcessor and have a couple questions before I get started that I'm hoping you guys can assist with.

1) I would like to put a file on the local filesystem in the configured location and have Solr see and process the file without additional effort on my part. Is this doable in any way? From what I've seen, this is not supported and I must manually call a URL (e.g. http://foo/solr/dataimport?command=full-import). The manual, URL-based invocation method seems perfectly logical in a database-oriented world, where one might schedule an update to run regularly but in my case I have a couple identical indexes I load balance between and don't want to run the same hefty query multiple times in parallel. As such, I'm doing one query, writing the results to an XML file, pushing that file to each box, and then wanting that file processed. I'd like the process to be as automated as possible.

2) I would like any files processed by Solr to be deleted after they've been imported. I haven't seen any way to do this currently. I thought I might be able to subclass something, but FileListEntityProcessor, for example, doesn't seem to give any handles at the right time in the workflow to delete a file. Is there somewhere else I can look?

3) When reading the DIH documentation, I ran across this statement: "When delta-import command is executed, it reads the start time stored in conf/dataimport.properties. It uses that timestamp to run delta queries and after completion, updates the timestamp in conf/dataimport.properties." If it really does update the date to the completion date, what happens to any files added between the start and end dates? Are they lost?

4) For delta imports, I don't see mention of how processed files are ordered other than that it tries not to re-import files older than that mentioned in the conf/dataimport.properties file. In cases where order matters, does it order the files by name or creation date or ...?


Solution

  • the idea of solr/lucene is not to work as an database. It's an index. This means, it's an index for data, which resit somewhere else - regardless of the possibility to (index and) store the data in solr/lucene additional for special features (highlighting, etc). Therefore there is no out-of-the-box possibility to add single documents and delete those documents after importing. By the way, it's best practice to keep to original documents at an database, file system, etc. Probably you keep the original documents, but not on solr/lucene server?!

    URL-based invocation method seems perfectly logical in a database-oriented world, where one might schedule an update to run regularly but in my case I have a couple identical indexes I load balance between and don't want to run the same hefty query multiple times in parallel.

    You could define an operating-system scheduled job (cronjob) to start an delta import.

    I would like any files processed by Solr to be deleted after they've been imported

    I never heard about, that solr is able to do that. As i wrote above, the idea is, that solr is an index of data which is stored somewhere else. So the DIH expected the data/all the documents at "somewehere". If you remove the original documents from "somewehere" and updates the index, the intended target is to synchronize the index content with the (now) available documents...

    Are they lost?

    No.

    it reads the start time stored in conf/dataimport.properties. It uses that timestamp to run delta queries and after completion, updates the timestamp in conf/dataimport.properties."

    Solr reads the start time, run the delta queries and (...if it is finished, solr...) updates(... the start time...) as timestamp in conf/dataimport.properties."

    does it order the files by name or creation date or ...?

    Not sure, but i think it reads the files in the given order from the filesystem