python machine-learning scikit-learn nlp precision-recall

How to improve Precision and Recall on Imbalanced Dataset in Python

I built a supervised model to classify medical text data (my output predicts the positive or negative occurrence of a disease). The data is very imbalanced (130 positive cases compared to 1600 negative cases, which is understandable since the disease is rare). I first cleaned the data (removed unnecessary words, lemmatization, etc..) and applied POS afterwards. I then applied TfidfVectorizer and TfidfTransformer to this cleaned data. For classification, I tried both SVM and Random Forest, but achieved only 56% precision and 58% recall for the positive data even after tuning their parameters with GridSearchCV (I also made class_weight = 'balanced'). Does anyone have advice as to how to improve this low precision and recall? Thank you very much.

Here is my current Pipeline (obviously I only use one of the classifiers when I run it, but I displayed both just to show their parameters).

pipeline = Pipeline([ 

('vectors', TfidfVectorizer(ngram_range = (2,3),norm = 'l1', token_pattern = r"\w+\b\|\w+" ,min_df = 2, max_features = 1000).fit(data['final'])),

('classifier', RandomForestClassifier(n_estimators = 51, min_samples_split = 8, min_samples_leaf = 2, max_depth = 14, class_weight= 'balanced')),

('classifier', SVC(C = 1000, gamma = 1, class_weight = 'balanced', kernel='linear')),

])

Solution

First, have a look at the data that your classifiers are seeing. Measure the correlation between features and the class (Pearson correlation is fine) and check if you have irrelavant features. For example, the word patient is not usually considered a stopword, but in a medical database, it will most likely be one.

Also consider using more complex features, like bigrams or trigrams, or even adding word embeddings (e.g., take a pretrained model such as word2vec or GloVe, and then take the average text vector).

N.B.: These days text classification is mostly done with neural networks and word embeddings. That said, your dataset isn't very big, so it may not be worth it to change methods (or maybe you don't want to, for some reason).