I have a huge amount of PDF/Word/Excel/etc. files to index (40GB now, but maybe up to 1000GB in some monhts) and I was considering to use Solr, with a DataImportHandler and Tika. I have read a lot of topic on this subject, but there is one problem for which I still not found a solution : if I index all the files (full or delta import), remove a file in the filesystem, and index again (with delta import), then the document corresponding to the file will not be removed from the index.
Here are some possibilites :
Do you have any other idea, or a way to perform the second solution ? Thanks in advance.
Some details :
Have you thought about using a file system monitor to catch deletions and update index?
I think apache.commons.io supports that.
Check out apache.commons.io.monitor package, FileAlterationObserver and FileAlterationMonitor classes.