Search code examples
apache-sparkapache-spark-mllibapache-spark-ml

How to load dataset from String in spark


From spark's document, I know I can load from a libsvm-formatted dataset from file.

However, I want to run codes in a remote spark cluster, so I hard coded the iris dataset into my code, and I want to directly load from this String object.

However, when looking into the DataFrameReader object, I find there is no API which supports direct loading dataset from String.


Solution

  • I tried this way-

     val irisData =                                                                                           
       """                                                                                                    
         |"sepal_length","sepal_width","petal_length","petal_width","label"                                   
         |5.1,3.5,1.4,0.2,Iris-setosa                                                                         
         |4.9,3.0,1.4,0.2,Iris-setosa                                                                         
         |4.7,3.2,1.3,0.2,Iris-setosa                                                                         
         |4.6,3.1,1.5,0.2,Iris-setosa                                                                         
       """.stripMargin                                                                                        
    
     println(irisData)                                                                                        
    
       "sepal_length","sepal_width","petal_length","petal_width","label"                                      
       5.1,3.5,1.4,0.2,Iris-setosa                                                                            
       4.9,3.0,1.4,0.2,Iris-setosa                                                                            
       4.7,3.2,1.3,0.2,Iris-setosa                                                                            
       4.6,3.1,1.5,0.2,Iris-setosa                                                                            
    
    val stringDS = spark.createDataset(irisData.split("\n"))(Encoders.STRING)                           
     val irisDatasetDF = spark.read                                                                           
       .option("inferSchema", "true")                                                                         
       .option("header", "true")                                                                              
       .csv(stringDS)                                                                                         
     irisDatasetDF.show(false)                                                                                
    
       +------------+-----------+------------+-----------+-----------+                                        
       |sepal_length|sepal_width|petal_length|petal_width|label      |                                        
       +------------+-----------+------------+-----------+-----------+                                        
       |5.1         |3.5        |1.4         |0.2        |Iris-setosa|                                        
       |4.9         |3.0        |1.4         |0.2        |Iris-setosa|                                        
       |4.7         |3.2        |1.3         |0.2        |Iris-setosa|                                        
       |4.6         |3.1        |1.5         |0.2        |Iris-setosa|                                        
       +------------+-----------+------------+-----------+-----------+