I have multiple csv files present in hadoop folder. each csv files will have the header present with it. the header will remain the same in each file.
I am writing these csv files with the help of spark dataset like this in java
df.write().csv(somePath)
I was also thinking of using coalsec(1) but it is not memory efficient in my case
I know that this write will also create some redundant files in a folder. so need to handle that also
I want to merge all these csv files into one big csv files but I don't want to repeat the header in the combined csv files.I just want one line of header on top of data in my csv file
I am working with python to merging these files. I know I can use hadoop getmerge command but it will merge the headers also which are present in each csv files
so I am not able to figure out how should I merge all the csv files without merging the headers
coalesce(1)
is exactly what you want.
Speed/memory usage is the tradeoff you get for wanting exactly one file