Search code examples
javasolrj

How to efficiently (ie w/o memory leaks) retrieve already exising documents in a Solr index?


I believe my method is leaking memory since in the profiler the number of "Surviving generations" keeps increasing :

enter image description here

In production I get "OOM heap space" errors after a while and I now think my method is the culprit.

As a background my method goal is to retrieve already existing documents in an index. The list is then used afterwards to tell if the document can remain in the index or can be removed (e.g. the corresponding document has been deleted from disk) :

public final List<MyDocument> getListOfMyDocumentsAlreadyIndexed() throws SolrServerException, HttpSolrClient.RemoteSolrException, IOException {

 final SolrQuery query = new SolrQuery("*:*");

query.addField("id");
query.setRows(Integer.MAX_VALUE); // we want ALL documents in the index not only the first ones

SolrDocumentList results = this.getSolrClient().
    query(
            query).getResults();

listOfMyDocumentsAlreadyIndexed = results.parallelStream() // tried to replace with stream with the same behaviour
    .map((doc) -> {

        MyDocument tmpDoc = new MyDocument();

        tmpDoc.setId((String) doc.getFirstValue(
                "id"));

                // Usually there are things done here to set some boolean fields
                // that I have removed for the test and this question

        return tmpDoc;
    })
    .collect(Collectors.toList());        

return listOfMyDocumentsAlreadyIndexed;
}

The test for this method does the following call in a for loop 300 times (this simulates the indexing loops since my program indexes one index after the other) :

List<MyDocument> listOfExistingDocsInIndex = index.getListOfMyDocumentsAlreadyIndexed();

I tried to nullify it after use (in the test there is no use, it was just to see if it has any effect) without any noticeable change : listOfExistingDocsInIndex = null;

This is the call tree I get from Netbeans profiler (I've just started using the profiler) :

enter image description here

What can I change / improve to avoid this memory leak (it is actually a memory leak, isn't it ?) ?

Any help appreciated :-),


Solution

  • So far I've found that to avoid memory leaks while retrieving all documents from an index, one has to avoid using :

    query.setRows(Integer.MAX_VALUE);
    

    Instead documents have to be retrieved chunks by chunks where the chunks size resides between 100 and 200 documents as depicted in Solr's wiki :

    SolrQuery q = (new SolrQuery(some_query)).setRows(r).setSort(SortClause.asc("id"));
    String cursorMark = CursorMarkParams.CURSOR_MARK_START;
    boolean done = false;
    while (! done) {
      q.set(CursorMarkParams.CURSOR_MARK_PARAM, cursorMark);
      QueryResponse rsp = solrServer.query(q);
      String nextCursorMark = rsp.getNextCursorMark();
      doCustomProcessingOfResults(rsp);
      if (cursorMark.equals(nextCursorMark)) {
        done = true;
      }
      cursorMark = nextCursorMark;
    }
    

    Now surviving generations remain stable over time :

    enter image description here

    The drawback is that garbage collector is much more active and retrieval is much slower (I did not benchmark it so I have no metrics to show).