python python-2.7 simulation database-performance

For a simulation system, which data structure is most suitable?

I am in the planning phase of building a simulation and need ideas on how to represent data, based on memory and speed considerations.

At each time-step, the simulation process creates 10^3 to 10^4 new data records, and looks at each new or existing records (there are 10^6 to 10^8 of them) then either deletes it or modifies it.

Each record has 3-10 simple fields, each either an integer or a string of several ASCII characters. In addition, each record has 1-5 other fields, each a variable-length list containing integers. A typical record weighs 100-500 bytes.

The modify-or-delete process works like this: For this record, compute a function whose arguments are the values of some of this record's fields, and the values of these fields of another record. Depending on the results, the process prepares to delete or modify its fields in some way.

Then repeat for each other record. Then move to the next record and repeat. When all records have been processed, the simulation is ready to move to the next time-step.

Just before moving to the next time-step, apply all the deletions and modifications as prepared.

The more records allowed, the better the simulation. If all records are in RAM, downside is simulation size and presumably upside is speed. The simulation doesn't need to be realtime, but obviously I don't want it too slow.

To represent each record in memory, I know of these options: a list or dict (with some lists nested in it), or a class instance. To store away all the records and continue the simulation another day, the options in order of decreasing familiarity to me are: a csv file where each line is a record, or just put all records in RAM then put them into a file (perhaps using pickle), or use some sort of database.

I've learned Python basics plus some concepts like generators but haven't learned database, haven't tried pickling, and obviously need to learn more. If possible, I'd avoid multiple computers because I have only 1, and concurrency because it looks too scary.

What would you advise about how to represent records in memory, and about how to store away the simulated system?

Solution

If we take your worst case, 10**8 records and 500 bytes per record, that would be a lot of RAM, so it's worth designing some flexibility and assuming not all records will always be resident in RAM. You could make an abstraction class that hides the details of where the records are.

class Record(object):
    def __init__(self, x, y, z):
        pass # code goes here
    def get_record(id):
        pass # code goes here

Instead of using the name get_record() you could use the name __index__() and then your class will act like a list, but might be going out to a database, or referencing a RAM cache, or whatever. Just use integers as the ID values. Then if you change your mind about the persistence store (switch from database to pickle or whatever) the actual code won't change.

You could also try just making a really huge swapfile and letting the virtual memory system handle shuffling records in and out of actual RAM. This is easy to try. It does not have any easy way to interrupt a calculation and save the state.

You could represent each record as a tuple, even a named tuple. I believe a tuple would have the lowest overhead of any "container" object in Python. (A named tuple just stores the names once in one place, so it's low overhead also.)