Search code examples
javaclassrefactoringdata-objects

Refactoring large data object


What are some common strategies for refactoring large "state-only" objects?

I am working on a specific soft-real-time decision support system which does online modeling/simulation of the national airspace. This piece of software consumes a number of live data feeds, and produces a once-per-minute estimate of the "state" of a large number of entities in the airspace. The problem breaks down neatly until we hit what is currently the lowest-level entity.

Our mathematical model estimates/predicts upwards of 50 parameters for a timeline of several hours into the past and future for each of these entities, roughly once per minute. Currently, these records are encoded as a single Java class with a lot of fields (some get collapsed into an ArrayList). Our model is evolving, and the dependencies among the fields are not yet set in stone, so each instance wanders through a convoluted model, accumulating settings as it goes along.

Currently we have something like the following, which uses a builder pattern approach to build up the contents of the record, and enforce what the known dependencies are (as a check against programmer error as evolve the mode.) Once the estimate is done, we convert the below into an immutable form using a .build() type method.

final class OneMinuteEstimate {

  enum EstimateState { INFANT, HEADER, INDEPENDENT, ... };
  EstimateState state = EstimateState.INFANT; 

  // "header" stuff
  DateTime estimatedAtTime = null;
  DateTime stamp = null;
  EntityId id = null;

  // independent fields
  int status1 = -1;
  ...

  // dependent/complex fields...
  ... goes on for 40+ more fields... 

  void setHeaderFields(...)
  {
     if (!EstimateState.INFANT.equals(state)) {
        throw new IllegalStateException("Must be in INFANT state to set header");
     }

     ... 
  }

}

Once a very large number of these estimates are complete, they are assembled into timelines where aggregate patterns/trends are analyzed. We have looked at using an embedded database but have struggled with performance issues; we'd rather get this sorted out in terms of data modeling and then incrementally move portions of the soft-real-time code into an embedded data store.

Once the "time sensitive" pieces of this are done, the products are flushed to flat files and a database.

Problems:

  • It's a giant class, with way too many fields.
  • There is very little behavior encoded in the class; it's mostly a holder for data fields.
  • Maintaining the build() method is extremely cumbersome.
  • It feels clumsy to manually maintain a "state machine" abstraction merely for the purpose of ensuring that a large number of dependent modeling components are properly populating a data object, but it has saved us a lot of frustration as the model evolves.
  • There is a lot of duplication, particularly when the records described above are aggregated into very similar "rollups" which amount to rolling sums/averages or other statistical products of the above structure in time series.
  • While some of the fields could be clumped together, they are all logically "peers" of one another, and any breakdown we've tried has resulted in having behavior/logic artificially split and needing to reach two levels deep in indirection.

Out of the box ideas entertained, but this is something we need to evolve incrementally. Before anyone else says it, I'll note that one could suggest that our mathematical model is insufficiently crisp if the data representation for that model is this hard to get ahold of. Fair point, and we're working that, but I think that's a side-effect of an R&D environment with a lot of contributors, and a lot of concurrent hypotheses in play.

(Not that it matters, but this is implemented in Java. We use HSQLDB or Postgres for output products. We don't use any persistence framework, partly out of a lack of familiarity, partly because we have enough performance trouble with just the database alone and hand-coded storage routines... we're skeptical of moving towards additional abstraction.)


Solution

  • I had much of the same problem you did.

    At least I think I did, sounds like I did. Representation was different, but at 10,000 feet, sounds pretty much the same. Crapload of discrete, "arbitrary" variables and a bunch of ad hoc relationships among them (essentially business driven), subject to change at a moment's notice.

    You also have another issue, which you sorta mentioned, and that was the performance requirement. Sounds like faster is better, and likely a slow perfect solution would be tossed out for the fast lousy one, simply because the slower one can't meet a baseline performance requirement, no matter how good it is.

    To put it simply, what I did was I designed a simple domain specific rule language for my system.

    The entire point of the DSL was to implicitly express relationships and package them up in to modules.

    Very crude, contrived example:

    D = 7
    C = A + B
    B = A / 5
    A = 10
    RULE 1: IF (C < 10) ALERT "C is less than 10"
    RULE 2: IF (C > 5) ALERT "C is greater than 5"
    RULE 3: IF (D > 10) ALERT "D is greater than 10"
    MODULE 1: RULE 1
    MODULE 2: RULE 3
    MODULE 3: RULE 1, RULE 2
    

    First, this is not representative of my syntax.

    But you can see from the Modules, that it is 3, simple rules.

    The key though, is that it's obvious from this that Rule 1 depends on C, which depends on A and B, and B depends on A. Those relationships are implied.

    So, for that module, all of those dependencies "come with it". You can see if I generated code for Module 1 it might look something like:

    public void module_1() {
        int a = 10;
        int b = a / 5;
        int c = a + b;
        if (c < 10) {
            alert("C is less than 10");
        }
    }
    

    Whereas if I created Module 2, all I would get is:

    public void module_2() {
        int d = 7;
        if (d > 10) {
            alert("D is greater than 10.");
        }
    }
    

    In Module 3 you see the "free" reuse:

    public void module_3() {
        int a = 10;
        int b = a / 5;
        int c = a + b;
        if (c < 10) {
            alert("C is less than 10");
        }
        if (c > 5) {
            alert("C is greater than 5");
        }
    }
    

    So, even though I have one "soup" of rules, the Modules root the base of the dependencies, and thus filter out the stuff it doesn't care about. Grab a module, shake the tree and keep what's left hanging.

    My system used the DSL to generate source code, but you can easily have it create a mini runtime interpreter as well.

    Simple topological sorting handled the dependency graph for me.

    So, the nice thing about this is that while there was inevitable duplication in the final, generated logic, at least across modules, there wasn't any duplication in the rule base. What you as a developer/knowledge worker maintain is the rule base.

    What is also nice is that you can change an equation, and not worry so much about the side effects. For example, if I change do C = A / 2, then, suddenly, B drops out completely. But the rule for IF (C < 10) doesn't change at all.

    With a few simple tools, you can show the entire dependency graph, you can find orphaned variables (like B), etc.

    By generating source code, it's going to run as fast as you want.

    In my case, it was interesting to see a rule drop a single variable and see 500 lines of source code vanish from the resulting module. That's 500 lines I didn't have to crawl through by hand and remove during maintenance and development. All I had to do was change a single rule in my rule base and let "magic" happen.

    I was even able to do some simple peephole optimization and eliminate variables.

    It's not that hard to do. Your rule language can be XML, or a simple expression parser. No reason to go full boat Yacc or ANTLR on it if you don't want to. I'll put a plug in for S-Expressions, no grammar needed, brain dead parsing.

    Spreadsheets also make a great input tool, actually. Just be strict on the formatting. Kind of sucks for merging in SVN (so, Don't Do That), but end users love it.

    You may well be able to get away with an actual rule based system. My system wasn't dynamic at runtime, and didn't really need sophisticated goal seeking and inference, so I didn't need the overhead of such a system. But if one works for you out of the box, then happy day.

    Oh, and for an implementation note, for those who don't believe you can hit the 64K code limit in a Java method, well I can assure you it can be done :).