Search code examples
machine-learningstatisticslogistic-regression

Model building methodology


I happen to have a dataset of 4000 rows, where the target variable has 3999 1's and only one 0.

It is a quarterly data, and I'm supposed to calculate the probability of success in the next quarter. Is it feasible to apply logistic regression here?

Or can somebody provide me a better alternative?


Solution

  • I agree that the dataset is too unbalanced. The one negative example can not be statistically significant. Also, you can't do cross-validation so you can't even validate your model.

    You can try to visualize data in lower dimension to check if the negative example is clearly the outlier. You can look for the 'anomaly detection' topic to find out more.

    However you'll not find the answer if the 1's will occur in the next quarter, because the data is not correct for that. With such data, if you could have more negative examples, you could predict what will be the label of next, new sample, with given features. And that is not the answer what will be probability of similar dataset occurring in the next quarter.