machine-learning classification random-forest decision-tree ensemble-learning

How to deal with this unbalanced-class skewed data-set?

I have to deal with Class Imbalance Problem and do a binary-classification of the input test data-set where majority of the class-label is 1 (the other class-label is 0) in the training data-set.

For example, following is some part of the training data :

93.65034,94.50283,94.6677,94.20174,94.93986,95.21071,1
94.13783,94.61797,94.50526,95.66091,95.99478,95.12608,1
94.0238,93.95445,94.77115,94.65469,95.08566,94.97906,1
94.36343,94.32839,95.33167,95.24738,94.57213,95.05634,1
94.5774,93.92291,94.96261,95.40926,95.97659,95.17691,0
93.76617,94.27253,94.38002,94.28448,94.19957,94.98924,0

where the last column is the class-label - 0 or 1. The actual data-set is very skewed with a 10:1 ratio of classes, that is around 700 samples have 0 as their class label, while the rest 6800 have 1 as their class label.

The above mentioned are only a few of the all the samples in the given data-set, but the actual data-set contains about 90% of samples with class-label as 1, and the rest with class-label being 0, despite the fact that more or less all the samples are very much similar.

Which classifier should be best for handling this kind of data-set ?

I have already tried logistic-regression as well as svm with class-weight parameter set as "balanced", but got no significant improvement in accuracy.

Solution

but got no significant improvement in accuracy.

Accuracy isn't the way to go (e.g. see Accuracy paradox). With a 10:1 ratio of classes you can easily get a 90% accuracy just by always predicting class-label 0.

Some good starting points are:

try a different performance metric. E.g. F1-score and Matthews correlation coefficient
"resample" the dataset: add examples from the under-represented class (over-sampling) / delete instances from the over-represented class (under-sampling; you should have a lot of data)
a different point of view: anomaly detection is a good try for an imbalanced dataset
a different algorithm is another possibility but not a silver shoot. Probably you should start with decision trees (often perform well on imbalanced datasets)

EDIT (now knowing you're using scikit-learn)

The weights from the class_weight (scikit-learn) parameter are used to train the classifier (so balanced is ok) but accuracy is a poor choice to know how well it's performing.

The sklearn.metrics module implements several loss, score and utility functions to measure classification performance. Also take a look at How to compute precision, recall, accuracy and f1-score for the multiclass case with scikit learn?.