Search code examples
machine-learningstatisticscomputer-visionmahoutlogistic-regression

Logistic regression classifier training count


I want train a gender classifier but confused with some problem..

There are about 100,000 labeled dataset (25,000 male . 75,000 female). I will devide this dataset into local-train(60%) and local-test(40%)

List<LabeledPoint> males = getMales(); // <-- 25,000
List<LabeledPoint> females = getFemales(); // <-- 75,000

List<LabeledPoint> local = males.addAll(females); // <-- union
List[] splits = randomSplit(local, new double[] {0.6, 0.4});
List trainingData = splits[0];
List testData = splits[1];
LogisticRegressionModel model = new LogisticRegressionWithLBFGS()
  .setNumClasses(10)
  .run(trainingData);
List<LabeledPoint> predictedList = model.predict(testData);

for(LabeledPoint predict : predictedList){
    if(predict.label() =="f" && predict.predictLabel()!="f"){
        fErrorCount ++;
    }
    if(predict.label() =="m" && predict.predictLabel()!="m"){
        mErrorCount ++;
    }
}

The prediction result for local-test data was

#1 (all number is based on item count)
Total prediction :36152 , error : 6619, error ratio 0.18% 
F :27747.0 , error : 2916.0, error ratio 0.10% 
M :8405.0 , error : 3703.0, error ratio 0.44% 

As you can see predicting for female is very good, but for male is too poor. I expected same error ratio for both of female and male. This classifier seems good for female targeting but useless for male targeting.

So I did a pre-sampled to balance female and male are equal. now I got 50,000 labeled dataset (25,000 male, 25,000 female)

List<LabeledPoint> males = getMales(); // <-- 25,000
List<LabeledPoint> females = getfemales().sublist(0,25000); // <-- hard resized.
// from here is same.

the prediction result was

#2
Total prediciton :16814 , error : 4369 - 0.259842987986202% 
F :8407.0 , error : 2225.0 - 0.2646604020459141% 
M :8407.0 , error : 2144.0 - 0.25502557392648983% 

Online unlabeled data what model should predict this in production, may tend to female ratio is larger than male(like #1, 75:25), but this ratio could be changed something like (f:m=30:70), (f:m=80:20) in future.

In that case

  1. how do I build a most adoptable model?
  2. there is no way to build a model which guarantee stable error ratio for both female, male.
  3. there is no way to build a model which guarantee stable error ratio even female male ratio changed?
  4. is try #2 make nonesense?

thanks.


Solution

  • Your problem here is less about ratios and more about diagnosing whether your model suffers from high bias (underfitting) or high variance (overfitting).

    For your first run:

    Your male:female ratio was 1:3, with 25,000 labelled observations for males and 75,000 labelled observations for females.

    Seems like your algorithm had high error for males, for your test split (40%). Find out what your error is for your training split run (60%). Once you get this, proceed as follows:

    Case 1 (likely): If your training set error for males is significantly lower than it is for your test run (which I suspect is the case), your model suffers from high variance (overfitting). In other words, your model fits the training data well for males, but fails to generalize for new examples (test data). One way to fix this is to simply add more data. Which I assume may be tough, since you only have 25,000 male examples. Another way to fix this is via regularization. You can see a bit more about this here. In a nutshell, regularization penalizes your cost function for having thetas (parameters) that are too high. Very high theta values tend to result in overfitting.

    Case 2: If your training set error for males is also high (near the same level as the test run error), you most likely have a high bias (underfitting) problem. One way to fix this so to increase the complexity in your model. Perhaps, add more features, or make your model a higher order polynomial function than what it currently is. But be careful, you don't want your female classification to be overfitted as a result of this.

    Comments about your second run: Making the ratio 50:50 by decreasing the female observations from 75,000 to 25,000 will seldom make a positive difference. In fact, it can even be detrimental, as you experienced. Playing with ratios, in this case, is not the answer. Once again, diagnose whether your model suffers from high variance or high bias, and proceed accordingly.