Search code examples
solrspark-streamingsolrj

Indexing multiple documents


I encounter errors while trying to index multiple documents in solr with spark-streaming using solrj. Each record I parse and index, each micro-batch.

In the code below, the first method (tagged) functions as expected. The second method (tagged) does not do anything, it does not event fail.

In the first option, I index a record for each partition; useless, but functional. In the second method, I convert each element of my partitions into a document and then try to index each of them, but fail: no records are showing in the collection.

I use solrj 4.10 and spark-2.2.1.

//method 1
myDStream.foreachRDD { rdd => rdd.foreachPartition { records =>

  val solrServer = new HttpSolrServer(collectionUrl)

  val document = new SolrInputDocument()
  document.addField("key", "someValue")
  ...

  solrServer.add(document)
  solrServer.commit()
}}
//method 2
myDStream.foreachRDD { rdd => rdd.foreachPartition { records =>

  val solrServer = new HttpSolrServer(collectionUrl)

  records.map { record =>

    val document = new SolrInputDocument()
    document.addField("key", record.key)
    ...

    solrServer.add(document)
    solrServer.commit()
  }
}}

I would like to understand why the second method does not work and find a solution to multiple documents indexing.


Solution

  • The solution was to process the records through rdds:

    myDStream.foreachRDD { rdd => rdd.foreach { record =>
    
      val solrServer = new HttpSolrServer(collectionUrl)
    
      val document = new SolrInputDocument()
      document.addField("key", record.key)
      ...
    
      solrServer.add(document)
      solrServer.commit()
    }}
    

    See EricLavault's comment above for more information about the issue source suspicion.