Search code examples
javasolrsolrj

Recursively scan documents for indexing in a folder in SolrJ


I understand that in SimplePostTool (post.jar), there is this command to automatically detect content types in a folder, and recursively scan it for documents for indexing into a collection:
bin/post -c gettingstarted afolder/

This has been useful for me to do mass indexing of all the files that are in the folder. Now that I'm moving to production and plans to use SolrJ to do the indexing as it can do more things like robustness checks and retires for indexes that fails.

However, I can't seems to find a way to do the same in SolrJ. Is it possible for this to be done in SolrJ? I'm using Solr 5.3.0

Thank you.

Regards,
Edwin


Solution

  • If you're looking to submit content to an extracting request handler (for indexing PDFs and similar rich documents), you can use the ContentStreamUpdateRequest method as shown at Uploading data with SolrJ:

    SolrClient server = new HttpSolrClient("http://localhost:8983/solr/my_collection");
    ContentStreamUpdateRequest req = new ContentStreamUpdateRequest("/update/extract");
    req.addFile(new File("my-file.pdf"));
    server.request(req);
    

    To iterate through a directory structure recursively in Java, see Best way to iterate through a directory in Java.

    If you're planning to index plain content (and not use the request handler), you can do that by creating the documents in SolrJ itself and then submitting the documents to the server - there's no need to write them to a temporary file in between.