I have setup a small size Hadoop Yarn cluster where Apache Spark is running. I have some data (JSON, CSV) that I upload to Spark (data-frame) for some analysis. Later, I have to index all data-frame data into Apache SOlr. I am using Spark 3 and Solr 8.8 version.
In my search, I have found a solution here but it is for different version of Spark. Hence, I have decided to ask someone for this.
Is there any builtin option for this task. I am open to use SolrJ and pySpark (not scal shell).
I found a solution myself. Till now Lucidword spark-solr module does not support these versions of Spark (3.0.2) and Solr (8.8). I have first installed PySolr module and then use following example code to finish my job:
import pysolr
import json
def solrIndexer(row):
solr = pysolr.Solr('http://localhost:8983/solr/spark-test')
obj = json.loads(row)
solr.add(obj)
#load data to dataframe from HDFS
csvDF = spark.read.load("hdfs://hms/data/*.csv", format="csv", sep=",", inferSchema="true", header="true")
csvDF.toJSON().map(solrIndexer).count()
If there is some better option or improvement in above code, you are welcome to answer.