Search code examples
matlabneural-networkdeep-learningdata-fitting

Confusion regarding Preparation of data for the task of data fitting using NN


I am using a multi layer perceptron for fitting a model to a data given input-output pair following the tutorial https://www.mathworks.com/help/deeplearning/gs/fit-data-with-a-neural-network.html.

Confusion 1) I am having a tough time understanding where the test set which has been created using the command net.divideParam.testRatio used? In general, we split the data set into train, validation and an unseen test set that is used for performanace evaluation and reporting the confusion matrix. This approach is usually done for classification task. But for the problem of regression and model fitting es. using NN should we not explicitly have a test set that is unseen during training? Is this command net.divideParam.testRatio creating that unseen test set but it is never used in testing the network? The program code uses all of the inputs in the testing. It is unclear if after training I should use an unseen dataset for testing and then reporting the performance or not.

% Create a Fitting Network
hiddenLayerSize = 10;
net = fitnet(hiddenLayerSize);
inputs = houseInputs;
targets = houseTargets;
% Set up Division of Data for Training, Validation, Testing
net.divideParam.trainRatio = 70/100;
net.divideParam.valRatio = 15/100;
net.divideParam.testRatio = 15/100; 
% Train the Network
[net,tr] = train(net,inputs,targets);

% Test the Network
outputs = net(inputs);
errors = gsubtract(outputs,targets);
performance = perform(net,targets,outputs)

Confusion 2) When using regression model mvregress do we follow the same approach as the answer for confusion 1)

Please help. I am unable to find the correct practices and approach for these initial steps and I believe that the proper use makes a great impact in the result.


Solution

  • I can help you mostly with confusion 1). When you train a neural network, you are separating the dataset in 3 sets:

    1. Training set, used to train the network (the only dataset which actually allows the update of the Network weights);
    2. Validation set, used to stop the training (this is the parameter Validation checks in the GUI);
    3. Test set, which influences the performance plots and the overall performance of the fitter;

    Therefore, of these 3, only the training set is seen by the network and influences the weights update; while the validation set allows to stop the training if the network is overfitting the training data (an improvement in training data fitting does not improve the validation data fitting/classification). Finally, test set is useful for a first check of the fitter performance. If you check the value of net.divideParam, you can see that the network stores the percentage of values for each set; during the training, the inputs and targets will be randomly divided according to these 3 values. This is why if you use the toolbox to plot the performance of the network. You can also avoid this to be done randomly by setting the net.divideFcn to 'divideind'. This is mostly useful if you know well your dataset. When you train the network using

    [net,tr] = train(net,inputs,targets);
    

    tr stores the results of the training, including the indexes of the training (tr.trainInd), validation (tr.valInd) and test set (tr.testInd). To retrieve each of the sets it is possible to index the input with those inputs, while other parameters, such as the accuracy or the performance of the network can be retrieved through tr.

    Regarding confusion 2, I think that regression model mvregress works with a different approach: it should just evaluate the parameters for the fitting without splitting the dataset in three slices. It should be up to you to evaluate the regression by adding some points or removing them from the inputs.