Search code examples
pandasdataframepysparkdirectorynested

How to combine multiple CSV files into one file if the column seq is different or some the files are not having any header


For example i have over 300 files in the nested folder and i have combine all of them using pyspark or python pandas

File1 -Date,channel,spend,clicks File2 - date ,channel,clicks,spend File3- no File4 : some extra columns also there apart from mandatory ones Etc... Etc

I am expecting a single file combining all the files in folder with different structures


Solution

  • You can enforce the schema object to take care of files with no headers and unify the structure using spark.read.schema(ScheamObject).csv(FilesPath).
    You can use coalesce(1) in writing out to fit all records into one file: spark.write.coalesce(1).csv(DestinationPath)