Search code examples
pythoncsvapache-sparkhadoophdfs

merge multiple csv files present in hadoop into one csv files in local


I have multiple csv files present in hadoop folder. each csv files will have the header present with it. the header will remain the same in each file.

I am writing these csv files with the help of spark dataset like this in java

df.write().csv(somePath)

I was also thinking of using coalsec(1) but it is not memory efficient in my case

I know that this write will also create some redundant files in a folder. so need to handle that also

I want to merge all these csv files into one big csv files but I don't want to repeat the header in the combined csv files.I just want one line of header on top of data in my csv file

I am working with python to merging these files. I know I can use hadoop getmerge command but it will merge the headers also which are present in each csv files

so I am not able to figure out how should I merge all the csv files without merging the headers


Solution

  • coalesce(1) is exactly what you want.

    Speed/memory usage is the tradeoff you get for wanting exactly one file