Best way to import 20GB CSV file to Hadoop

I have a huge 20GB CSV file to copy into Hadoop/HDFS. Of course I need to manage any error cases (if the the server or the transfer/load application crashes).

In such a case, I need to restart the processing (in another node or not) and continue the transfer without starting the CSV file from the beginning.

What is the best and easiest way to do that?

Using Flume? Sqoop? a native Java application? Spark?

Thanks a lot.

Solution

If the file is not hosted in HDFS, flume wont be able to parallelize that file (Same issue with Spark or other Hadoop based frameworks). Can you mount your HDFS on NFS and then use file copy?

One advantage of reading using flume would be to read the file and publish each line as a separate record and publish those records and let flume write one record to HDFS at a time, if something goes wrong you could start from that record instead of starting from beginning.

How to configure port for a Spring Boot application
Java Poi XSSF - create pivot table with multiple expandable columns
Get a NoSuchBeanDefinitionException in my DAO Spring + Hibernate + MySQL code
Eclipse maven after compile cannot see target/classes folder in Eclipse
Linking JSF inputText with backing bean's field without showing its value
generate javascript from java class as a maven build step
Elegant way of holding large static typesafe dictionary in java - or avoiding code too large
@WebMvcTest fails with java.lang.IllegalStateException: Failed to load ApplicationContext
Calling unmanaged C\C++ DLL methods from Java?
API Gateway Custom Authorizer: Control error message and code
Spring boot: Unable to start embedded Tomcat servlet container
I'm trying to send TCP SYN packet in java using Pcap4j library, but I get a NullPointerException on line "ipv4Builder.build();"
ExecutorCompletionService? Why do need one if we have invokeAll?
How is the toString() method automatically called inside System.out.println()
Java Keytool error after importing certificate , "keytool error: java.io.FileNotFoundException & Access Denied"
')' Expected (java)
Parenthesis representation of BinTree to BinTree
KeyListener not working the way I want it to for arrow keys?
How do I delete an object in java?
After Spring boot 3.2.x upgrade , ZoneddateTime is getting persisted in DB always with UTC format
Maximized JFrame's location is [-8,-8]. Can it be fixed?
SNMP v3 Trap listener implementation not working for AuthNoPriv in case of both MD5 and SHA protocols using SNMP4j
Is `JNI_OnLoad` always invoked in main thread?
Eclipse 'loading descriptor' takes ages
Java reduce a collection of string to a map of occurence
Wildfly deploy failed: JBAS014671 and WFLYCTL0080 - Access is denied
Is the Java compiler the same that is on Linux/Windows?
Eclipse: Class not found: javac1.7
Does try-with-resources call dispose() method?
A Java update has caused me to be unable to compile/run my React Native project in Android