Search code examples
sasenterprise-guide

Train & score regressions many times with different Dev-Val splits


I created a project where I extract a dataset, split it in Dev-Val, then use these for some 12 alternative model candidates where I train on dev and calculate performance statistics on both the scored Dev and Val datasets.

Now I want to run this process flow many times with different splits, so I can see how stable that model is. I split the data by adding a ranuni(seed).

What I would like to do is run a loop, that does the split with a different seed, and then executes the process flow that trains the model and scores both Dev and Val datasets.

Is there anyway to do such a loop in EG? Of do I need to create Stored Processes so I can execute them from SAS Base code? Problem I run into is that the SAS Base code where I can loop runs on the servers and has no 'knowledge' of the EG project on the client.

Has anyone tried this before? Any tips would be welcome.


Solution

  • It sounds like you want to do something that's not too far from bootstrap simulation. Not identical, but the concept is there: repeatedly re-sample the same data.

    As such, I'd do it with the same technique. (Note, I'm not commenting on whether you should actually do what you're asking - there are a lot of reasons not to; regression and machine learning maxim #1 is never mix training set with test set with validation set, because you end up with false confidence).

    PROC SURVEYSELECT is pretty good at doing this. The seminal paper in the field is Don't be LOOPy by David Cassell, and goes like this.

    Let's say you are running your regression on sashelp.class. You are randomizing without regard to any stratification.

    proc surveyselect data=sashelp.class
      out=class_sample
      seed=12345
      method=SRS /* This is different from bootstrapping: without replacement */
      n=15 /* you want 15 in your training set and 4 in your validation set */
      rep=100 /* you want 100 replicates, ie, you want to test it 100 times */
      outall /* this outputs all 19 to the dataset */
    ;
    run;
    

    That makes a dataset of 19*100 replicates. sample=1 is a training set data row and sample=0 is a validation set row.

    Now, when you do your regression and your analysis, you just add by replicate; to everything you do. It will run faster this way than the other way to do this (a macro loop or something like that) and is very little code.

    You should be able to integrate this into your process flow without much change, and shouldn't matter that you're using EG. This primarily just changes the initial input datastep.