Search code examples
apache-sparksolrlucenesolrjsolrcloud

Indexing of Spark 3 Dataframe into Apache Solr 8


I have setup a small size Hadoop Yarn cluster where Apache Spark is running. I have some data (JSON, CSV) that I upload to Spark (data-frame) for some analysis. Later, I have to index all data-frame data into Apache SOlr. I am using Spark 3 and Solr 8.8 version.

In my search, I have found a solution here but it is for different version of Spark. Hence, I have decided to ask someone for this.

Is there any builtin option for this task. I am open to use SolrJ and pySpark (not scal shell).


Solution

  • I found a solution myself. Till now Lucidword spark-solr module does not support these versions of Spark (3.0.2) and Solr (8.8). I have first installed PySolr module and then use following example code to finish my job:

    import pysolr
    import json
    
    def solrIndexer(row):
        solr = pysolr.Solr('http://localhost:8983/solr/spark-test')
        obj = json.loads(row)
        solr.add(obj)
    
    #load data to dataframe from HDFS
    csvDF = spark.read.load("hdfs://hms/data/*.csv", format="csv", sep=",", inferSchema="true", header="true")
    
    csvDF.toJSON().map(solrIndexer).count()
    

    If there is some better option or improvement in above code, you are welcome to answer.