python-2.7 machine-learning statistics logistic-regression

Computational Logistic Regression With Python, Different Sample Sizes

Currently, I am trying to implement a basic logistic regression algorithm in Python to differentiate between A vs. B.

For my training and test data, I have ~50,000 samples of A vs. 1000 samples of B. Is this a problem if I use half the data of each to train the algorithm and the other half as testing data (25000 train A, 500 train B and so on for testing accuracy).

If so, how can I overcome this problem. Should I consider resampling, doing some other "fancy stuff".

Solution

How much of a problem it is depends on the nature of your data. The bigger issue will be that you simply have a huge class imbalance (50 As for every B). If you end up getting good classification accuracy anyway, then fine - nothing to do. What to do next depends on your data and the nature of the problem and what is acceptable in a solution. There really isn't a dead set "do this" answer for this question.