Input set: thousands(>10000) of csv files, each containing >50000 entries. output: Store those data in mysql db.
Approach taken: Read each file and store the data into database. Below is the code snippet for the same. Please suggest if this approach is ok or not.
PreparedStatement pstmt2 = null;
try
{
pstmt1 = con.prepareStatement(sqlQuery);
result = pstmt1.executeUpdate();
con.setAutoCommit(false);
sqlQuery = "insert into "
+ tableName
+ " (x,y,z,a,b,c) values(?,?,?,?,?,?)";
pstmt2 = con.prepareStatement(sqlQuery);
Path file = Paths.get(filename);
lines = Files.lines(file, StandardCharsets.UTF_8);
final int batchsz = 5000;
for (String line : (Iterable<String>) lines::iterator) {
pstmt2.setString(1, "somevalue");
pstmt2.setString(2, "somevalue");
pstmt2.setString(3, "somevalue");
pstmt2.setString(4, "somevalue");
pstmt2.setString(5, "somevalue");
pstmt2.setString(6, "somevalue");
pstmt2.addBatch();
if (++linecnt % batchsz == 0) {
pstmt2.executeBatch();
}
}
int batchResult[] = pstmt2.executeBatch();
pstmt2.close();
con.commit();
} catch (BatchUpdateException e) {
log.error(Utility.dumpExceptionMessage(e));
} catch (IOException ioe) {
log.error(Utility.dumpExceptionMessage(ioe));
} catch (SQLException e) {
log.error(Utility.dumpExceptionMessage(e));
} finally {
lines.close();
try {
pstmt1.close();
pstmt2.close();
} catch (SQLException e) {
Utility.dumpExceptionMessage(e);
}
}
As @Ridrigo has already pointed out, LOAD DATA INFILE is the way to go. Java is not really needed at all.
If the format of your CSV is not something that can directly be inserted into the database, your Java code can renter the picture. Use it to reorganize/transform the CSV and save it as another CSV file instead of writing it into the database.
You can also use the Java code to iterate through the folder that contains the CSV, and then execute the system command for the
Runtime r = Runtime.getRuntime();
Process p = r.exec("mysql -p password -u user database -e 'LOAD DATA INFILE ....");
you will find that this is much much faster than running individual sql queries for each row of the CSV file.