machine-learning neural-network backpropagation

how to judge a neural network?

I wrote a neural network, its mostly based (bug fixed) on the neural nets from James McCaffrey https://visualstudiomagazine.com/articles/2015/04/01/back-propagation-using-c.aspx i came across various Git projects and books, using his code And as he worked for MS research i assumed his work would be good, maybe not top of the bill (its not running on top of cuda or so) but its code that i can read, although i'm not into the science side of it. His sample worked on a dataset much alike my problem.

I had the goal to solve some image classification (pixel info based data set) This problem wasn't easy to recreate but I managed to create a data set of 50 good scenarios and a 50 bad scenarios. When plotted out the measurements in a scatter diagram both sets had a lot fuzzy boundary overlappings. I myself was unable to make something out of it, it was to fuzzy for me.As I had 5 inputs per sample, I wondered if a neural net might be able to find the inner relations and solve my fuzzy data classification problem.

And well so it did .. well i kinda guess.
As depending on the seeding of weights (i got to 80%), the amount of nodes and the time of learning; I get training scores of around 90% to 85% and lately 95%

First I played with the random initialization of the weights. Then I played with the amount of nodes. The I played with Learn Rate,Momentum, and weight decay. they went from (scoring 85 to 90%):

// as in the example code i used
int maxEpochs = 100000;
double learnRate = 0.05;
double momentum = 0.01;
double weightDecay = 0.0001;

to (scores 95%)

int maxEpochs = 100000;
double learnRate = 0.02;  //had a huge effect
double momentum = 0.01;
double weightDecay = 0.001; //had a huge effect

I'm a bit surprised that the number of nodes had less effect as compared changing random initialization of the net, and changing the above constants.

However it makes me wonder.

As a general thumb-rule is 95% a high score ? (not sure where the limits are but i think it also depends on the data set, while I am amazed by 95% I wonder if it would be possible to tweak it to 97%.
The number of hidden nodes, should i try to minimize them ? currently its a 5:9:3 but I had a similar score once with a 5:6:3 network.
Is it normal for a neural network to have great scoring influence by changing initial random seed weights (different start seed) to get to a model; as i thought the training would overcome the start situation.

Solution

First, sorry if I didn't understand correctly, but it looks like you have 100 training examples and no validation / test set. This is rather small for a training set, which makes it easy for the NN to overtrain on it. You also seem to have chosen a small NN, so maybe you actually don't overfit. The best way to check would be to have a test set.

As to your questions:

what a "good score" is depends entirely on your problem. For instance, on MNIST (widely used digit recognition dataset) this would be considered quite bad, the best scores are above 99.7% (and it's not too hard to get 99% with a ConvNet), but on ImageNet for instance that would be awesome. A good way to know if you're good or not is to compare to human performance somehow. Reaching it is usually hard, so being a bit below it is good, above is very good, and far below it is bad. Again this is subjective, and depends on your problem.
You should definetly try to minimize the number of hidden nodes, following Occam's rasor rule: among several models, the simplest is the best one. It has 2 main advantages: it will run faster, and it will generalize better (if two models perform similarly on your training set, the simplest one is most likely to work better on a new test set).
The initialization is known to change a lot the result. However, the big differences are rather between the different initialization methods: constant / simple random (widely used, usually (truncated) normal distribution) / random more clever (Xavier initialization for instance) / "cleverer" initializations (pre-computed features, etc. Harder to use). Between two random initializations generated exactly the same way, the difference in performance should not be that big. My guess is that in some cases, you just did not train long enough (the time needed for training properly can change a lot depending on initialization). My other guess is that the small size of your dataset and network makes the evaluation more dependent on initial weights than they usually are.

It is normal that the learning rate and weight decay change the result a lot, however finding the optimal values for those efficiently can be hard.