Search code examples
pythonexternalinternalsdatahandler

Which kind of of data handling, internal or external, is commonly used?


I'm writing a program in Python3.5 that reads a data set and does some stuff (it's DICOM data if you are familiar with it). It uses:

  • Large arrays of size (512,512,141) or bigger.
  • Alot of small metadata (many single data entries).

Now my program has many different components that will all use the same set of data. My question is what is the best practice of handling this data? Do I:

  • Pass the data through internal variables? I feel this is an inefficient use of memory/space as each time I have to pass data down a line (without destroying where it came from) that it is unnecessarily adding resources?
  • Use a file storage system? When I say this I refer to something such as SQL, HDF5 or even a dictionary of values inside python.
  • Read the data in every time I need it? So for instance I would have to import the library, load the file, search the file for the var, store the var and close the file again. Seems quite tedious.

Which is the best method? I'm not accessing this data once or twice but in the order of 20+ times. Is there a method I don't know about that I should be using?

Thanks in advance, I value your help (and criticism where necessary) to always improve myself as a programmer and, a human being.


Solution

  • Seems like you're actually asking multiple questions here. Let me try to tease them apart:

    Should I store all my data in-memory?

    Can you? Do you have enough memory to do so comfortably? Then do it. Load it once and pass it around, or pass around some interface to the data, as necessary. How you interface with it is your choice (see below). Otherwise you have no choice but to make multiple calls to disk I/O. But I think it's generally a bad decision to read the same data from disk redundantly, since disk I/O tends to be a bottleneck resource.

    Should I pass loads of data to lots of functions? Won't this create unnecessary copies of the data?

    Python function arguments are passed "by assignment" so to speak. To use C terminology, although not technically precise, it's more like pass-by-reference than pass-by-value. You usually don't see this behavior because 1) lots of things in python are immutable, and 2) assignment statements in python just reassign the name to a different value. Examples where you can see this behavior are mutable objects like a list, dict or any kind of object with mutable member properties. Try passing a list to a function and modifying it inside. It will also be modified in the passing context after the function returns.

    How should I interface with my data?

    This depends on several things that I can think of. First, did you decide to store the data in memory or on disk (see first question)? Second, from where do you need to access the data? Third, if you're storing it in memory, do you need the data to persist between runs of the program?

    If you can store the data in memory, just need to access it locally, and don't need it to persist, I would go with just some sort of nested python dict, maybe making one or more custom classes to simplify the interface to the data.

    If you can store the data in memory, but either need to access it over a network or need it to persist between runs of the program, I would use redis or a similar key-value store to manage the data. redis is really easy to learn and there's good python library support.

    If you can't store the data in memory, but don't want to parse it over and over again, you should at least index it before you write it back to disk. You might come up with your own indexing scheme just using the file system if you only need local access. If this becomes too complicated or you need network access, you should probably use a database system.