How to Speed Up Normalization in Encog.NET

I have the following code to normalize my document. My document contains the following method to normalize my SmallShuffledTrainingData CSV.

static void Normalize()
{
    Console.WriteLine("Normalizing...");
    var analyst = new EncogAnalyst();

    var wizard = new AnalystWizard(analyst);
    wizard.Wizard(SmallShuffledTrainingData, true, AnalystFileFormat.DecpntComma);

    // customer id
    analyst.Script.Normalize.NormalizedFields[0].Action = Encog.Util.Arrayutil.NormalizationAction.PassThrough;

    var norm = new AnalystNormalizeCSV();
    norm.Analyze(SmallShuffledTrainingData, true, CSVFormat.English, analyst);
    norm.ProduceOutputHeaders = true;
    norm.Normalize(SmallShuffledTrainingDataNormalized);
    analyst.Save(AnalystFile);
}

I am only trying to normalize one column since it takes such a long time.

My document has 332k rows and 25 columns.

Is there anyway to speed up the normalization process other than breaking it down into smaller and smaller documents?

If I do break it down, how could I possibly combine them into one document being as the normalization needs to see all records to find the highest and lowest values of a given column?

Solution

First of all, thanks! The slowness was a "non-scaleability" issue during the analysis phase of the CSV wizard. This would show up on particularly large files. I was able to reproduce the issue using your code above. I just checked a fix for this into GitHub. You can see the commit here.

https://github.com/encog/encog-dotnet-core/commit/4f168c04cfd85d647f18dca5c7a2a77fff50c1e5

This will go into Encog 3.3 (which is not yet released). But you can grab the fix from GitHub. With this fix, I can normalize a similar sized file in just a few minutes.

Some other suggestions.

If you add this line:

norm.Report = new ConsoleStatusReportable();

You will get progress updates.

You also need to designate the prediction field, or you will run into an error later on. Something like this:

wizard.TargetFieldName = "field:1";