Search code examples
pythoncsvetlsql-loadersmooks

Choice of technology for loading large CSV files to Oracle tables


I have come across a problem and am not sure which would be the best suitable technology to implement it. Would be obliged if you guys can suggest me some based on your experience. I want to load data from 10-15 CSV files each of them being fairly large 5-10 GBs. By load data I mean convert the CSV file to XML and then populate around 6-7 stagings tables in Oracle using this XML. The data needs to be populated such that the elements of the XML and eventually the rows of the table come from multiple CSV files. So for e.g. an element A would have sub-elements coming data from CSV file 1, file 2 and file 3 etc.

I have a framework built on Top of Apache Camel, Jboss on Linux. Oracle 10G is the database server. Options I am considering,

  1. Smooks - However the problem is that Smooks serializes one CSV at a time and I cant afford to hold on to the half baked java beans til the other CSV files are read since I run the risk of running out of memory given the sheer number of beans I would need to create and hold on to before they are fully populated written to disk as XML.
  2. SQLLoader - I could skip the XML creation all together and load the CSV directly to the staging tables using SQLLoader. But I am not sure if I can a. load multiple CSV files in SQL Loader to the same tables updating the records after the first file. b. Apply some translation rules while loading the staging tables.
  3. Python script to convert the CSV to XML.
  4. SQLLoader to load a different set of staging tables corresponding to the CSV data and then writing stored procedure to load the actual staging tables from this new set of staging tables (a path which I want to avoid given the amount of changes to my existing framework it would need).

Thanks in advance. If someone can point me in the right direction or give me some insights from his/her personal experience it will help me make an informed decision.

regards, -v-

PS: The CSV files are fairly simple with around 40 columns each. The depth of objects or relationship between the files would be around 2 to 3.


Solution

  • Unless you can use some full-blown ETL tool (e.g. Informatica PowerCenter, Pentaho Data Integration), I suggest the 4th solution - it is straightforward and the performance should be good, since Oracle will handle the most complicated part of the task.