Search code examples
pythonlanguage-agnosticrefactoringmaintainability

Program structure in long running data processing python script


For my current job I am writing some long-running (think hours to days) scripts that do CPU intensive data-processing. The program flow is very simple - it proceeds into the main loop, completes the main loop, saves output and terminates: The basic structure of my programs tends to be like so:

<import statements>
<constant declarations>

<misc function declarations>

def main():
   for blah in blahs():
      <lots of local variables>
      <lots of tightly coupled computation>

      for something in somethings():
          <lots more local variables>
          <lots more computation>

   <etc., etc.>

   <save results>

if __name__ == "__main__":
    main()

This gets unmanageable quickly, so I want to refactor it into something more manageable. I want to make this more maintainable, without sacrificing execution speed.

Each chuck of code relies on a large number of variables however, so refactoring parts of the computation out to functions would make parameters list grow out of hand very quickly. Should I put this sort of code into a python class, and change the local variables into class variables? It doesn't make a great deal of sense tp me conceptually to turn the program into a class, as the class would never be reused, and only one instance would ever be created per instance.

What is the best practice structure for this kind of program? I am using python but the question is relatively language-agnostic, assuming a modern object-oriented language features.


Solution

  • First off, if your program is going to be running for hours/days then the overhead of switching to using classes/methods instead of putting everything in a giant main is pretty much non-existent.

    Additionally, refactoring (even if it does involve passing a lot of variables) should help you improve speed in the long run. Profiling an application which is designed well is much easier because you can pin-point the slow parts and optimize there. Maybe a new library comes along that's highly optimized for your calculations... a well designed program will let you plug it in and test right away. Or perhaps you decide to write a C Module extension to improve the speed of a subset of your calculations, a well designed application will make that easy too.

    It's hard to give concrete advice without seeing <lots of tightly coupled computation> and <lots more computation>. But, I would start with making every for block it's own method and go from there.