I've been tinkering with Python (v3) on and off for the past few years. As a learning exercise, I decided a couple weeks ago to refactor a collection of bash scripts I wrote. I also think some features of the language will dramatically speed up the processing. These bash scripts typically run for 5 or 6 days processing huge data files. The Python version also significantly improves the readability and maintainability of the code.
First I got the algorithms working as a program in one file. The algorithm uses several large lookup tables, implemented variously as lists and dictionaries. Now I want to break it up - the core logic going into one file, and a second file containing a class(es?) holding the lookup tables and their associated functions. The data tables take about 350 lines of code, and the functions are about the same size.
Q: What is the preferred way of structuring the class module file?
For example, I started doing it this way, let's call it case 1:
class Zebra:
_stripe_keys = [ ....... ]
_stripe_info = [ [.....], [.....], ... [.....] ]
_stripes = [ dict(zip( stripe_keys, info )) for info in stripe_info ]
<<< many such tables >>>
def __init__(self, name):
self.name = name
def function_one(self):
do something
def function_two(self):
do something
<<< etc... >>>
Then I realized this might be better, case 2:
_stripe_keys = [ ....... ]
_stripe_info = [ [.....], [.....], ... [.....] ]
_stripes = [ dict(zip( stripe_keys, info )) for info in stripe_info ]
<<< many such tables >>>
class Zebra:
def __init__(self, name):
self.name = name
def function_one(self):
do something
def function_two(self):
do something
<<< etc... >>>
And then I saw yet another possibility, case 3, but somehow I'd have to pass the data class into the function class:
class ZebraTables:
_stripe_keys = [ ....... ]
_stripe_info = [ [.....], [.....], ... [.....] ]
_stripes = [ dict(zip( stripe_keys, info )) for info in stripe_info ]
<<< many such tables >>>
def __init__(self, name):
self.name = name
class Zebra:
def __init__(self, name):
self.name = name
def function_one(self):
do something
def function_two(self):
do something
<<< etc... >>>
The data tables are essentially constant. If there was ever a reason to make two instances of this class, the data should be shared and not duplicated. The static data in the source code takes tens of MB of memory, combined with additional data read from disk at startup brings the total to about 600MB). I think this means case 2 is what I want, but I'm not certain. I come from an embedded background using primarily C, so Object Oriented techniques aren't my specialty - yet!
Personally, I would not store large lists inside the same module with the class. What about saving them in some format in an external python module, that manages them and load them when you need it?
Depending on the size and needs you can use pickle
, pandas
, csv
, or directly an SQL/NoSQL DB.