From spark's document, I know I can load from a libsvm-formatted
dataset from file.
However, I want to run codes in a remote spark cluster, so I hard coded the iris dataset into my code, and I want to directly load from this String object.
However, when looking into the DataFrameReader object, I find there is no API which supports direct loading dataset from String
.
I tried this way-
val irisData =
"""
|"sepal_length","sepal_width","petal_length","petal_width","label"
|5.1,3.5,1.4,0.2,Iris-setosa
|4.9,3.0,1.4,0.2,Iris-setosa
|4.7,3.2,1.3,0.2,Iris-setosa
|4.6,3.1,1.5,0.2,Iris-setosa
""".stripMargin
println(irisData)
"sepal_length","sepal_width","petal_length","petal_width","label"
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
val stringDS = spark.createDataset(irisData.split("\n"))(Encoders.STRING)
val irisDatasetDF = spark.read
.option("inferSchema", "true")
.option("header", "true")
.csv(stringDS)
irisDatasetDF.show(false)
+------------+-----------+------------+-----------+-----------+
|sepal_length|sepal_width|petal_length|petal_width|label |
+------------+-----------+------------+-----------+-----------+
|5.1 |3.5 |1.4 |0.2 |Iris-setosa|
|4.9 |3.0 |1.4 |0.2 |Iris-setosa|
|4.7 |3.2 |1.3 |0.2 |Iris-setosa|
|4.6 |3.1 |1.5 |0.2 |Iris-setosa|
+------------+-----------+------------+-----------+-----------+