machine-learning scikit-learn artificial-intelligence random-forest

Why does the model fail to learn this game of filling up integers

Why does my model fail to learn to play this game of just producing an array of unique elements from 1 to 5 from a partially filled array?

===

I am trying to train a model to perform this task:

Given a fixed array of 5 elements consisting of at most ONE of each element from (1, 2, 3, 4, 5) and ONE OR MORE (0), replace the 0s with appropriate values so that the final array has exactly ONE of each (1, 2, 3, 4, 5).

So, here is how it should be played:

[1, 2, 3, 4, 0] => [1, 2, 3, 4, 5]
[4, 3, 0, 5, 1] => [4, 3, 2, 5, 1]
[0, 3, 5, 4, 0] => [2, 3, 5, 4, 1] OR [1, 3, 5, 4, 2]
...

This is not a complicated game (in human sense), but I want to see if a model can identify the rules (replace 0s with 1 to 5, so that final array has only exactly one element from (1, 2, 3, 4, 5)).

The way I did this is:

Generate N examples of combinations configurations with elements of [1, 2, 3, 4, 5] as answers, and randomly replace some of the elements as 0s.
For instance, one training example is [(0, 3, 5, 4, 0), (2, 3, 5, 4, 1)].
There can be multiple same input mapping to different output, i.e. [(0, 3, 5, 4, 0), (2, 3, 5, 4, 1)] and [(0, 3, 5, 4, 0), (1, 3, 5, 4, 2)] can be both present as two separate training instances.
Separate the training data set 10 fold, shuffled, and train using a RandomForestClassifier from Scikit-Learn.
A correct output is defined as the final configuration array has exactly ONE element from (1, 2, 3, 4, 5). So (2, 4, 4, 5, 1) is not valid.

===

Surprisingly, using 1000, 10000, 50000, and even 100000 training examples still results in the model only getting ~70% of the test cases right - meaning the model did not learn how to play the game with increasing training examples.

One thing I was thinking is that RandomForestClassifier is just not used for this type of problem, called structured machine learning, where the output is not a single category or a real-valued output, but a vector of output.

More questions:

Why does the model fail to learn this game?
Is this the right way to model this problem?
Is the data not enough to learn this task? But increasing data from 1000 to 100000 does not seem to help at all.

Solution

lejlot's answer is excellent, but I thought I'd add a bit of intuition as to why random forest fails in this case.

You have to keep in mind that Machine Learning isn't some magic way to impart intelligence to computers; it's simply a way of fitting a particular model to your data and using that model to make generalizations. As the old adage goes, "all models are wrong, but some are useful". You've hit on a case where the model is wrong as usual, but also happens to be useless!

The output space: Random forests at their core are basically a clever and generalizable way of mapping inputs to outputs. Your output space has 5^5 = 3125 possible unique outputs, and only 5! = 120 of these are valid (i.e. outputs with one of each number). The only way for a random forest to know whether an output is valid is if it has seen it: so in order to work correctly, your training set will have to include examples with all of those 120 outputs.
The input space: when a random forest encounters an input it has seen before, it will map that directly to an output that it has seen before. But what if it encounters an input it has not seen? For example, what if you ask for the answer to [0, 2, 3, 4, 1] and this is not in the training set? In terms of Euclidean distance (a useful way to think about how things are grouped) the closest result will probably be something like [0, 2, 3, 4, 0], which might map to [1, 2, 3, 4, 5], which is wrong. Thus we see that in order for random forests to work correctly, your training set will have to have all possible inputs. Some quick combinatorics show that your training set will have to be of size at least 5!*32 = 3840, with no duplicates.
The forest itself: even if you have a complete input space, the random forest does not consist of a simple dictionary mapping of inputs to outputs. Depending on the parameters of the model, the mapping is typically from groups of nearby results to a single answer, so that, for example, {[1, 2, 3, 4, 5], [1, 0, 3, 4, 5], [0, 1, 3, 4, 5]...} will all map to [1, 2, 3, 4, 5]. This sort of generalization is useful in most cases, but is not useful for your particular problem. The only way for the random forest to work in your case would be to push the max_depth and min_samples parameters to their extreme values, so that the forest is essentially a one-to-one mapping of inputs to their correct outputs: in other words your classifier would be just an extremely complicated way of building a dictionary.

To summarize: Machine Learning is just a model applied to data, which is useful in certain cases. In your case, the model is not all that useful: in order for Random Forests to work on your problem, you'd need to over-fit a comprehensive set of inputs and outputs. At that point, you might as well just construct a dictionary and call it a day.