I am creating a new application which uses ZODB and I need to import legacy data mainly from a postgres database but also from some csv files. There is a limited amount of manipulation needed to the data (sql joins to merge linked tables and create properties, change names of some properties, deal with empty columns etc).
With a subset of the postgres data I did a dump to csv files of all the relevant tables, read these into pandas dataframes and did the manipulation. This works but there are errors which are partly due to transferring the data into a csv first.
I now want to load all of the data in (and get rid of the errors). I am wondering if it makes sense to connect directly to the database and use read_sql or to carry on using the csv files.
The largest table (csv file) is only 8MB so I shouldn't have memory issues, I hope. Most of the errors are to do with encoding and or choice of separator (the data contains |,;,: and ').
Any advice? I have also read about something called Blaze and wonder if I should actually be using that.
If your CSV files aren't very large (as you say) then I'd try loading everything into postgres with odo
, then using blaze
to perform the operations, then finally dumping to a format that ZODB can understand. I wouldn't worry about the performance of operations like join
inside the database versus in memory at the scale you're talking about.
Here's some example code:
from blaze import odo, Data, join
for csv, tablename in zip(csvs, tablenames):
odo(csv, 'postgresql://localhost/db::%s' % tablename)
db = Data('postgresql://localhost/db')
# see the link above for more operations
expr = join(db.table1, db.table2, 'column_to_join_on')
# execute `expr` and dump the result to a CSV file for loading into ZODB
odo(expr, 'joined.csv')