I have a data source which is stored as a large number of gzipped, csv files. The header info for this source is a separate file.
I'd like to load this data into spark for manipulation - is there an easy way to get spark to figure out the schema/load the headers? There are literally hundreds of columns, and they might change between runs, would strongly prefer not to do this by hand
This can easily be done in spark : if your header file is : headers.csv and it only contains header then simply first load this file with header set as true :
val headerCSV = spark.read.format("CSV").option("header","true").load("/home/shivansh/Desktop/header.csv")
then get the Columns out in the form of Array:
val columns = headerCSV.columns
Then read the other file without the header information and pass this file as the header:
spark.read.format("CSV").load("/home/shivansh/Desktop/fileWithoutHeader.csv").toDF(columns:_*)
This will result in the DF with the combined value !