Solr Millions of search queries at once

I have to do pairing between reference addresses (about 30 millions in a PostGreSQL database) and files containing other addresses (many files a month, each file containing millions of adresses.

I already managed to import reference addresses in SolR and the search engine is great. Is there a way I can create a handler or plugin that would do the pairing for each row of the file very very fast?

I can't use the REST API doing millions of http request on SolR, that would be too slow. I just want to get the result as "id pairing", for example if ID 17 address of the file "somefile.csv" matches IDS 36, 452, and 13456 of reference addresses, that's all I need to retrieve, I can ask the database about more information later on.

I'd like to do it in Java but any other language is possible too.

Solution

Create two collections - one you already have, containing the address you want to match against. The second collection should contain the contents of the CSV file you want to use for looking up adresses - you can upload the CSV file directly as documents in Solr.

After having set up the two collections and their corresponding files, use the Streaming Expressions support in Solr to write a leftOuterJoin (or a different join if you want another behavior) between the two collections. That way you'll get all the entries from your uploaded file back, enriched with the original id of the matching address' id.

Given the two collections addresses (containing original adresses) and uploaded_file (containing the uploaded CSV rows), a join expression can be written as:

leftOuterJoin(
  search(uploaded_file, q=*:*, fl="id,address", sort="address asc"),
  select(
    search(addresses, q=*:*, fl="id,address", sort="address asc"),
    address AS original_address,
    id AS original_id
  ),
  on="address=original_address"
)

Using the "Stream" section on the Admin page for your collection will allow you to experiment with expressions.

Using the following test documents and collections, the result is as follows:

Documents in addresses:

  {
    "id":"add1",
    "address":"foo st. 33",
    "_version_":1606950875589246976},
  {
    "id":"add2",
    "address":"foo st. 49",
    "_version_":1606950875591344128},
  {
    "id":"add3",
    "address":"bar lane 1",
    "_version_":1606950875591344129},
  {
    "id":"add1-duplicate",
    "address":"foo st. 33",
    "_version_":1606951820879462400}

Documents in uploaded_file:

  {
    "id":"up1",
    "address":"foo st. 33",
    "_version_":1606950921604956160},
  {
    "id":"up2",
    "address":"foo st. 72",
    "_version_":1606950921607053312},
  {
    "id":"up3",
    "address":"bar lane 1",
    "_version_":1606950921607053313}

Running the expression gives us:

"docs": [
  {
    "original_address": "bar lane 1",
    "address": "bar lane 1",
    "id": "up3",
    "original_id": "add3"
  },
  {
    "original_address": "foo st. 33",
    "address": "foo st. 33",
    "id": "up1",
    "original_id": "add1"
  },
  {
    "original_address": "foo st. 33",
    "address": "foo st. 33",
    "id": "up1",
    "original_id": "add1-duplicate"
  },
  {
    "address": "foo st. 72",
    "id": "up2"
  },
  {
    "EOF": true,
    "RESPONSE_TIME": 28
  }
]

This gives you both addresses matching the one document uploaded, as well as those that doesn't match anything (where original_id is missing).