Search code examples
dataframescalaapache-spark

How can I dynamically map the schema of a pipe delimited text file without header in Spark Scala?


I am using Spark 2.x version.

I am trying to map schema dynamically after reading content in spark variable from a pipe delimited text file without header in Spark Scala.

Text File Content - File.txt:

12345678910|abc|234567
54182124852|def|784964

Schema to be mapped:

FS1|FS2|FS3

Below is the code I tried. Also I tried the following code from an example from the below link but it is not working. https://sparkbyexamples.com/spark/spark-read-text-file-rdd-dataframe/#dataframe-read-text

import org.apache.spark.sql.{DataFrame, Dataset}

val df = spark.read.text("dbfs:/FileStore/tables/Sample1-1.txt")

import spark.implicits.
val dataRDD = df.map(x => {
val elements = x.getString(0).split("|")
(elements(0),elements(1),elements(2))
}).toDF("FS1","FS2","FS3")
dataRDD.printSchema()
dataRDD.show(false)

After executing the above code, I am getting the below output which is not expected,

\+---+---+---+
|fs1|fs2|fs3|
\+---+---+---+
|1  |2  |3  |
|5  |4  |1  |
\+---+---+---+

I want the New File to be saved as - File1.txt which will contain the file content along with Header

FS1|FS2|FS3
12345678910|abc|234567
54182124852|def|784964

Solution

  • You just need to add a header to your csv file.

    You have a text file and you already know the delimiter which is |

    You should write something like this

    import org.apache.spark.sql.DataFrame
    
    val df = spark.read.option( "delimiter", "|" ).csv("dbfs:/FileStore/tables/Sample1-1.txt")
    val columns = Seq("FS1", "FS2", "FS3")
    val resultDF = df.toDF(columns :_*)
    
    // If you want your result as one file, you can use coalesce.
    
    resultDF.coalesce(1)
          .write
          .option("header","true")
          .option("delimiter","|")
          .mode("overwrite")
          .csv("output/path")