I have several features which can vote of whether a certain data item is worthy of showing to my users. You can think of each of them as a number between 0 and 1 where 1 means it is good and 0 means not worthy of showing to my users. I had just been doing a pretty standard thing of picking a weight for each property and performing a weighted sum to determine a single indicator to make the decision (much like a single perception unit).
However, sometimes different properties will overpower each other and give bad results. I think the basic problem is that the true optimal function is rather non-linear and of course the only rules that these weighted sums will give are linear by definition. To try to combat this, on one of the features which was getting "overpowered" in the weighted sum, I used it to multiply the whole single indicator. This allows this important feature to act as a "gatekeeper" -- if this one feature is too low it alone can keep data from going out.
To achieve a similar effect by doing the standard weighted sum I would have to make the weight on that feature so high that the other features would basically have no say... basically it comes back to the non-linearity of the best rule since this feature can be very important in some ranges but not in others.
I was wondering what is known about using a feature to multiply the whole result like this? Is there a specific reason that weighted sums are the most often used thing (other than simplicity)?
PS. Once I have much more data I will probably use a standard machine learning technique to actually learn the rule, but for now I am hand training it on sample data sets. I am going for simplicity right now while still trying to make it work well.
Your question is really good.
What you mention is an important problem. Important both from a theoretical and practical standpoint: how should I use my features to get the best results?
Let me give you an example, for part of speech tagging the origin of the document is not useful, because most words are used in the same way no matter if the article came from (article origin) the WSJ or from Wired. So features like article origin are "over powered" to use your lingo. But sometimes you get a word like "monitor", for which if you know where it appeared you'd almost know how to tag it (if it is from WSJ: verb, and if it is in Wired: noun).
The document origin feature is not a useful feature at first sight but it is a useful meta-feature about the word we're trying to tag. In the lingo of domain adaptation it characterizes the domain.
Some keywords you want to look at for this type of problem are:
Another useful bit of information is that linear classifiers are particularly bad at capturing these interactions, which you have even characterized as non-linear. If possible you should at least use quadratic or RBF or something more sophisticated that at least has a hope of capturing it.