I have a database of person records that I need to load into memory, since they will be accessed many times in a variety of orders. Up until now I've just instantiated one Python object per record. But now that I have 8,000,000 records to work with, I don't have enough memory for this straightforward approach.
In a flat file, each record only takes up at most 500 bytes, without compression (much less with compression). So the whole dataset is less than 4 GB on disk. Once each record is loaded by Python as an object, however, I estimate 40 GB of RAM will be used. My machine only has 12 GB of RAM.
I'm considering integrating C with my Python program, and storing each record as a struct in C. Does this sound like a good solution? Or is there a better way to store records compactly in Python, that doesn't require interfacing with C?
Update: The database I'm using is Hbase (http://hbase.apache.org/), running on Hadoop. The connection to Python happens through Thrift (http://thrift.apache.org/).
Update 2: I need to access all the records in the database in many different orders, and these orders are determined at run time. I guess, at every iteration, I could make 8,000,000 queries to the database, but I think this is likely to be quite slow.
Update 3: I don't think there's a good way to store the rows, such that they can be accessed sequentially. The order in which I need the records in the next iteration (my program is an iterative machine learning algorithm), is determined by a linear algebra projection onto a particular eigenvector of the data matrix during the previous iteration.
It sounds like numpy structured arrays could work well here. It will use a lot less memory than using python objects and numpy provides many fast & convenient operations on them. Additionally, the arrays can be memory mapped files which can be useful sometimes.
Whether or not a database is a good option (as others suggest) depends on your algorithm as well as data sizes. There are many cases where numpy is a better solution (less work, more efficient, etc.).