I wrote a neural network, its mostly based (bug fixed) on the neural nets from James McCaffrey https://visualstudiomagazine.com/articles/2015/04/01/back-propagation-using-c.aspx i came across various Git projects and books, using his code And as he worked for MS research i assumed his work would be good, maybe not top of the bill (its not running on top of cuda or so) but its code that i can read, although i'm not into the science side of it. His sample worked on a dataset much alike my problem.
I had the goal to solve some image classification (pixel info based data set) This problem wasn't easy to recreate but I managed to create a data set of 50 good scenarios and a 50 bad scenarios. When plotted out the measurements in a scatter diagram both sets had a lot fuzzy boundary overlappings. I myself was unable to make something out of it, it was to fuzzy for me.As I had 5 inputs per sample, I wondered if a neural net might be able to find the inner relations and solve my fuzzy data classification problem.
And well so it did .. well i kinda guess.
As depending on the seeding of weights (i got to 80%), the amount of nodes and the time of learning; I get training scores of around 90% to 85% and lately 95%
First I played with the random initialization of the weights. Then I played with the amount of nodes. The I played with Learn Rate,Momentum, and weight decay. they went from (scoring 85 to 90%):
// as in the example code i used
int maxEpochs = 100000;
double learnRate = 0.05;
double momentum = 0.01;
double weightDecay = 0.0001;
to (scores 95%)
int maxEpochs = 100000;
double learnRate = 0.02; //had a huge effect
double momentum = 0.01;
double weightDecay = 0.001; //had a huge effect
I'm a bit surprised that the number of nodes had less effect as compared changing random initialization of the net, and changing the above constants.
However it makes me wonder.
First, sorry if I didn't understand correctly, but it looks like you have 100 training examples and no validation / test set. This is rather small for a training set, which makes it easy for the NN to overtrain on it. You also seem to have chosen a small NN, so maybe you actually don't overfit. The best way to check would be to have a test set.
As to your questions:
what a "good score" is depends entirely on your problem. For instance, on MNIST (widely used digit recognition dataset) this would be considered quite bad, the best scores are above 99.7% (and it's not too hard to get 99% with a ConvNet), but on ImageNet for instance that would be awesome. A good way to know if you're good or not is to compare to human performance somehow. Reaching it is usually hard, so being a bit below it is good, above is very good, and far below it is bad. Again this is subjective, and depends on your problem.
You should definetly try to minimize the number of hidden nodes, following Occam's rasor rule: among several models, the simplest is the best one. It has 2 main advantages: it will run faster, and it will generalize better (if two models perform similarly on your training set, the simplest one is most likely to work better on a new test set).
It is normal that the learning rate and weight decay change the result a lot, however finding the optimal values for those efficiently can be hard.