Search code examples
scalaapache-sparkdataframemappingrdd

Convert every value of a dataframe


I need to modify the values of every column of a dataframe so that, they all are enclosed within double quotes after mapping but the dataframe still retains its original structure with the headers.

I tried mapping the values by changing the rows to sequences but it loses its headers in the output dataframe.

With this read in as input dataframe:

|prodid|name   |city|
+------+-------+----+
|1     |Harshit|VNS |
|2     |Mohit  |BLR |
|2     |Mohit  |RAO |
|2     |Mohit  |BTR |
|3     |Rohit  |BOM |
|4     |Shobhit|KLK |

I tried the following code.

val columns = df.columns
df.map{ row => 
           row.toSeq.map{col => "\""+col+"\"" }
    }.toDF(columns:_*)

But it throws an error stating there's only 1 header i.e value in the mapped dataframe. This is the actual result (if I remove ".df(columns:_*)"):

|               value|
+--------------------+
|["1", "Harshit", ...|
|["2", "Mohit", "B...|
|["2", "Mohit", "R...|
|["2", "Mohit", "B...|
|["3", "Rohit", "B...|
|["4", "Shobhit", ...|
+--------------------+

And my expected result is something like:

|prodid|name     |city  |
+------+---------+------+
|"1"   |"Harshit"|"VNS" |
|"2"   |"Mohit"  |"BLR" |
|"2"   |"Mohit"  |"RAO" |
|"2"   |"Mohit"  |"BTR" |
|"3"   |"Rohit"  |"BOM" |
|"4"   |"Shobhit"|"KLK" |

Note: There are only 3 headers in this example but my original data has a lot of headers so manually typing each and every one of them is not an option in case the file header changes. How do I get this modified value dataframe from that?

Edit: If I need the quotes on all values except the Integers. So, the output is something like:

|prodid|name     |city  |
+------+---------+------+
|1     |"Harshit"|"VNS" |
|2     |"Mohit"  |"BLR" |
|2     |"Mohit"  |"RAO" |
|2     |"Mohit"  |"BTR" |
|3     |"Rohit"  |"BOM" |
|4     |"Shobhit"|"KLK" |

Solution

  • Might be easier to use select instead:

    val df = Seq((1, "Harshit", "VNS"), (2, "Mohit", "BLR"))
      .toDF("prodid", "name", "city")
    
    df.select(df.schema.fields.map {
      case StructField(name, IntegerType, _, _) => col(name)
      case StructField(name, _, _, _) => format_string("\"%s\"", col(name)) as name
    }:_*).show()
    

    Output:

    +------+---------+-----+
    |prodid|     name| city|
    +------+---------+-----+
    |     1|"Harshit"|"VNS"|
    |     2|  "Mohit"|"BLR"|
    +------+---------+-----+
    

    Note that there are other numeric types as well such as LongType and DoubleType so might need to handle these as well or alternatively just quote StringType etc.