Search code examples
rclassificationsvmrandom-forestnaivebayes

Can I use a variable as an explanatory variable if it is used to devise the dependant variable?


I am trying to create 3 classification models: Naive Bayes, Random Forest and SVM.

The variable that I am trying to predict is Film Verdict with categories 'hit' or 'flop'. I devised the values of this variable through a formula Revenue/Budget where if the value of this formula was 1+, it was classified as a hit, or else flop.

My question is: Since I have used Revenue and Budget to create the Film Verdict variable, can I use those two as part of the explanatory/independent variables in my models?

Clarification: I have several other variables such as ActorRating, Tweet Polarity etc. used as input variables as well.


Solution

  • Yes, you can. Anything which will be available to you when you predict can be used. However, in your example, the model will be very basic and the output variable can be very easily derived from the input variables.

    Few things that you may want to read more:

    • Data Leakage: Using something from the test within train
    • Heteroscedasticity: When sub-populations have different variabilities from others
    • Collinearity: High correlation between independent variables
    • Overfitting: How well the model behaves between train and test

    Some algorithms are prune of some problems, so knowing that will help you find the best one.