Search code examples
scalacsvapache-sparkapache-zeppelin

Can I auto-load csv headers from a separate file for a scala spark window on Zeppelin?


I have a data source which is stored as a large number of gzipped, csv files. The header info for this source is a separate file.

I'd like to load this data into spark for manipulation - is there an easy way to get spark to figure out the schema/load the headers? There are literally hundreds of columns, and they might change between runs, would strongly prefer not to do this by hand


Solution

  • This can easily be done in spark : if your header file is : headers.csv and it only contains header then simply first load this file with header set as true :

    val headerCSV  = spark.read.format("CSV").option("header","true").load("/home/shivansh/Desktop/header.csv")
    

    then get the Columns out in the form of Array:

    val columns = headerCSV.columns
    

    Then read the other file without the header information and pass this file as the header:

    spark.read.format("CSV").load("/home/shivansh/Desktop/fileWithoutHeader.csv").toDF(columns:_*)
    

    This will result in the DF with the combined value !