Search code examples
scalaapache-sparkdataframerdd

Convert RDD[String] to Data Frame


I have a RDD[String] with this form:

VAR1,VAR2,VAR3,VAR4, ...
  a ,  b ,  c ,  d , ...
  e ,  f ,  g ,  h , ...

Which means that the first line is my header commas separated, and all the following lines are my data, also commas separated.

My purpose is to convert that unstructured RDD to a DataFrame like that:

_____________________
|VAR1|VAR2|VAR3|VAR4| 
|----|----|----|----|
|  a |  b |  c |  d | 
|  e |  f |  g |  h | 

I have tried to used the method toDF(), which convert a RDD[tuples] to a Dataframe. But the conversion from RDD[String] to RDD[tuples] sounds unrealistic regarding my number of variables (more than 200).

An other solution should be to use the method

sqlContext.createDataFrame(rdd, schema)

which requires to convert my RDD[String] to RDD[Row] and to convert my header (first line of the RDD) to a schema: StructType, but I don't know how to create that schema.

Any solution to convert a RDD[String] to a Dataframe with header would be very nice.

Thanks in advance.


Solution

  • You could also achieve this result with something like this:

    val data = Seq(
      ("VAR1, VAR2, VAR3, VAR4"),
      ("a, b, c, d"),
      ("ae, f, g, h")
    )
    
    val dataDS = sc.parallelize(data).toDS
    val result = spark.read.option("inferSchema","true").option("header","true").csv(dataDS)
    
    result.printSchema
    
    result.show
    

    The output from the above is:

    root
     |-- VAR1: string (nullable = true)
     |--  VAR2: string (nullable = true)
     |--  VAR3: string (nullable = true)
     |--  VAR4: string (nullable = true)
    

    and

    +----+-----+-----+-----+
    |VAR1| VAR2| VAR3| VAR4|
    +----+-----+-----+-----+
    |   a|    b|    c|    d|
    |  ae|    f|    g|    h|
    +----+-----+-----+-----+
    

    If your data had numerics in one of the columns (excluding the header) then the "inferSchema" should correctly infer that column as an Numeric type. For example, using this as the input data:

    val data = Seq(
      ("VAR1, VAR2, VAR3, VAR4"),
      ("a,   1, c, d"),
      ("ae, 10, g, h")
    )
    

    The output will be:

    root
     |-- VAR1: string (nullable = true)
     |--  VAR2: double (nullable = true)
     |--  VAR3: string (nullable = true)
     |--  VAR4: string (nullable = true)
    

    and

    +----+-----+-----+-----+
    |VAR1| VAR2| VAR3| VAR4|
    +----+-----+-----+-----+
    |   a|  1.0|    c|    d|
    |  ae| 10.0|    g|    h|
    +----+-----+-----+-----+
    

    I hope this helps.