I have the following code to normalize my document. My document contains the following method to normalize my SmallShuffledTrainingData
CSV.
static void Normalize()
{
Console.WriteLine("Normalizing...");
var analyst = new EncogAnalyst();
var wizard = new AnalystWizard(analyst);
wizard.Wizard(SmallShuffledTrainingData, true, AnalystFileFormat.DecpntComma);
// customer id
analyst.Script.Normalize.NormalizedFields[0].Action = Encog.Util.Arrayutil.NormalizationAction.PassThrough;
var norm = new AnalystNormalizeCSV();
norm.Analyze(SmallShuffledTrainingData, true, CSVFormat.English, analyst);
norm.ProduceOutputHeaders = true;
norm.Normalize(SmallShuffledTrainingDataNormalized);
analyst.Save(AnalystFile);
}
I am only trying to normalize one column since it takes such a long time.
My document has 332k rows and 25 columns.
Is there anyway to speed up the normalization process other than breaking it down into smaller and smaller documents?
If I do break it down, how could I possibly combine them into one document being as the normalization needs to see all records to find the highest and lowest values of a given column?
First of all, thanks! The slowness was a "non-scaleability" issue during the analysis phase of the CSV wizard. This would show up on particularly large files. I was able to reproduce the issue using your code above. I just checked a fix for this into GitHub. You can see the commit here.
https://github.com/encog/encog-dotnet-core/commit/4f168c04cfd85d647f18dca5c7a2a77fff50c1e5
This will go into Encog 3.3 (which is not yet released). But you can grab the fix from GitHub. With this fix, I can normalize a similar sized file in just a few minutes.
Some other suggestions.
If you add this line:
norm.Report = new ConsoleStatusReportable();
You will get progress updates.
You also need to designate the prediction field, or you will run into an error later on. Something like this:
wizard.TargetFieldName = "field:1";