Search code examples
javahadoopapache-sparksqoopflume

Best way to import 20GB CSV file to Hadoop


I have a huge 20GB CSV file to copy into Hadoop/HDFS. Of course I need to manage any error cases (if the the server or the transfer/load application crashes).

In such a case, I need to restart the processing (in another node or not) and continue the transfer without starting the CSV file from the beginning.

What is the best and easiest way to do that?

Using Flume? Sqoop? a native Java application? Spark?

Thanks a lot.


Solution

  • If the file is not hosted in HDFS, flume wont be able to parallelize that file (Same issue with Spark or other Hadoop based frameworks). Can you mount your HDFS on NFS and then use file copy?

    One advantage of reading using flume would be to read the file and publish each line as a separate record and publish those records and let flume write one record to HDFS at a time, if something goes wrong you could start from that record instead of starting from beginning.