Search code examples
machine-learningrandom-forestsupervised-learningunsupervised-learning

Random forest algorithms able to switch data sets


I'm curious as to whether research been done into random forests that combine unsupervised with supervised learning in a way allowing a single algorithm to find patterns in, and work with, multiple different data sets. I have googled every possible way to find research on this, and have come up empty. Can anyone point me in the right direction?

Note: I have already asked this question in the Data Sciences forum, but it's basically a dead forum so I came here.


Solution

  • (also read the comments and will incorporate the content in my answer)

    From what I read between the lines is that you want to use Deep networks in a transfer learning setting. However, this would not be based on decision trees. http://jmlr.csail.mit.edu/proceedings/papers/v27/mesnil12a/mesnil12a.pdf

    There are many elements in your question:

    1.) Machine learning algorithms, in general, don't care about the source of your data set. So basically you can feed the learning algorithms 20 different data sets and it will use all of them. However, the data should have the same underlying concept (except in the transfer learning case see below). This means: if you combine cats/dogs data with bills data this will not work or make it much harder for the algorithms. At least all input features need to be identical (exceptions exists), e.g, it is hard to combine images with text.

    2.) labeled/unlabeled: Two important terms: a data set is a set of data points with a fixed number of dimensions. Datapoint i might be described as {Xi1,....Xin} where each Xi might for example be a pixel. A label Yi is from another domain, e.g., cats and dogs

    3.) unsupervised learning data without any labels. (I have the gut feeling that this is not what you want.

    4.) semi-supervised learning: The idea is basically that you combine data where you have labels with data without labels. Basically you have a set of images labeled as cats and dogs {Xi1,..,Xin,Yi} and a second set which contains images with cats/dogs but no labels {Xj1,..,Xjn}. The algorithm can use this information to build better classifiers as the unlabeld data provide information on how images look in general.

    3.) transfer learning (I think this come closest to what you want). The Idea is that you provide a data set of cats and dogs and learn a classifier. Afterwards you want to train the classifier with images of cats/dogs/hamster. The training does not need to start from scratch but can use the cats/dogs classifier to converge much faster

    4.) feature generation / feature construction The idea is that the algoritm learns features like "eyes". This features are used in the next step to learn the classifier. I'm mainly aware of this in the context of deep learning. Where the algoritm learns in the first step concepts like edges and constructs increasingly complex features like faces cats intolerant it can describe things like "the man on the elephant. This combined with transfer learning is probably what you want. However deep learning is based on Neural networks besides a few exceptions.

    5.) outlier detection you provide a data set of cats/dogs as known images. When you provide the cats/dogs/hamster classifier. The classifier tells you that it has never seen something like a hamster before.

    6.) active learning The idea is that you don't provide labels for all examples (Data points) beforehand, but that the algorithms asks you to label certain data points. This way you need to label much less data.