What is the best way to quickly retrieve a large dataset in SOLR?
I have an index of 10 million records (6 string fields). The query and filter I'm using gets the result set down to 2.7 million records that I would like to programmatically page through and get the data for another process.
Currently I'm using SOLRJ and cursorMark to get 300000 records at a time. Each query takes 15-20 seconds. Is there a way to improve the speed? Decreasing the size of the "chunks" didn't seem to have an effect. Meaning reducing 300000 down to 50000 made the queries faster but there were more of them and the overall time was equivalent.
The issue I think is that SOLR has to get the entire 2.7mil result set and then chunk out the interval need on each call. Combine that with the "size" of the result set and I can understand why it is slow. I'm looking for some ideas on speeding it up.
My SOLRJ code is below:
Solr Version: 4.10.2
SolrQuery query = new SolrQuery();
query.setQuery("*:*");
query.setFilterQueries("text:\"*SEARCH STUFF*\"");
query.setParam("fl","id,srfCode");
query.setStart(0);
query.setRows(300000);
query.setSort("sortId", SolrQuery.ORDER.asc);
query.set("cursorMark", "*");
UPDATE I tried the following in an attempt to "stream" the data out of solr. Unfortunately, query itself is still the bottleneck to getting the data. Once I have it I can process it quickly. But I still need a faster way to get the data.
package org.search.builder;
import java.io.IOException;
import java.util.concurrent.BlockingQueue;
import java.util.concurrent.LinkedBlockingQueue;
import org.apache.solr.client.solrj.SolrQuery;
import org.apache.solr.client.solrj.SolrServerException;
import org.apache.solr.client.solrj.StreamingResponseCallback;
import org.apache.solr.client.solrj.impl.HttpSolrServer;
import org.apache.solr.common.SolrDocument;
import org.junit.Test;
public class SolrStream {
long startTime = 0;
long endTime = 0;
@Test
public void streaming() throws SolrServerException, IOException, InterruptedException {
long overallstartTime = System.currentTimeMillis();
startTime = System.currentTimeMillis();
HttpSolrServer server = new HttpSolrServer("https://solrserver/solr/indexname");
SolrQuery tmpQuery = new SolrQuery();
tmpQuery.setQuery("*:*");
tmpQuery.setFilterQueries("text:\"*SEARCH STUFF*\"");
tmpQuery.setParam("fl","id,srfCode");
tmpQuery.setStart(0);
tmpQuery.setRows(300000);
tmpQuery.set("cursorMark", "*");
//Sort needs to be unique or have tie breakers. In this case rowId will never be a duplicate
//If you can have duplicates then you need a tie breaker (sort should include a second column to sort on)
tmpQuery.setSort("rowId", SolrQuery.ORDER.asc);
final BlockingQueue<SolrDocument> tmpQueue = new LinkedBlockingQueue<SolrDocument>();
server.queryAndStreamResponse(tmpQuery, new MyCallbackHander(tmpQueue));
SolrDocument tmpDoc;
do {
tmpDoc = tmpQueue.take();
} while (!(tmpDoc instanceof StopDoc));
System.out.println("Overall Time: " + (System.currentTimeMillis() - overallstartTime) + " ms");
}
private class StopDoc extends SolrDocument {
// marker to finish queuing
}
private class MyCallbackHander extends StreamingResponseCallback {
private BlockingQueue<SolrDocument> queue;
private long currentPosition;
private long numFound;
public MyCallbackHander(BlockingQueue<SolrDocument> aQueue) {
queue = aQueue;
}
@Override
public void streamDocListInfo(long aNumFound, long aStart, Float aMaxScore) {
// called before start of streaming
// probably use for some statistics
currentPosition = aStart;
numFound = aNumFound;
if (numFound == 0) {
queue.add(new StopDoc());
}
}
@Override
public void streamSolrDocument(SolrDocument aDoc) {
currentPosition++;
if (queue.size() % 50000 == 0)
{
System.out.println("adding doc " + currentPosition + " of " + numFound);
System.out.println("Overall Time: " + (System.currentTimeMillis() - startTime) + " ms");
startTime = System.currentTimeMillis();
}
queue.add(aDoc);
if (currentPosition == numFound) {
queue.add(new StopDoc());
}
}
}
}
MatsLindh suggestion for the export request handler worked perfectly.
Add this requestHandler to your solrconfig if it is not already there
<requestHandler name="/export" class="solr.SearchHandler">
<lst name="invariants">
<str name="rq">{!xport}</str>
<str name="wt">xsort</str>
<str name="distrib">false</str>
</lst>
<arr name="components">
<str>query</str>
</arr>
</requestHandler>
Then call it this way: /export?q=rowId:[1 TO 4000]&fq=text:\"STUFF\"&fl=field1,field2&sort=sortColumn asc
*You are required to sort and have a fl set
Now I just need to figure out how to get the /export to work in a solrcloud set up.
Thanks!