Search code examples
pythontensorflowkerasoversampling

Oversampling of image data for keras


I am working on Kaggle competition and trying to solve a multilabel classification problem with keras.

My dataset is highly imbalanced. I am familiar with this concept and did it for simple machine learning datasets, but now sure how to deal with both images and csv data.

There are a couple of questions, but they did not help me.

Use SMOTE to oversample image data

How to oversample image dataset using Python?

Class
No finding            25462
Aortic enlargement     5738
Cardiomegaly           4345
Pleural thickening     3866
Pulmonary fibrosis     3726
Nodule/Mass            2085
Pleural effusion       1970
Lung Opacity           1949
Other lesion           1771
Infiltration            997
ILD                     792
Calcification           775
Consolidation           441
Atelectasis             229
Pneumothorax            185

I am trying to do oversampling, but not sure how to approach it. I have 15000 png images and train.csv dataset, which looks like:

image_id    class_name  class_id    rad_id  x_min   y_min   x_max   y_max   width   height
0   50a418190bc3fb1ef1633bf9678929b3    No finding  14  R11 0.0 0.0 0.0 0.0 2332    2580
1   21a10246a5ec7af151081d0cd6d65dc9    No finding  14  R7  0.0 0.0 0.0 0.0 2954    3159
2   9a5094b2563a1ef3ff50dc5c7ff71345    Cardiomegaly    3   R10 691.0   1375.0  1653.0  1831.0  2080    2336
3   051132a778e61a86eb147c7c6f564dfe    Aortic enlargement  0   R10 1264.0  743.0   1611.0  1019.0  2304    2880
4   063319de25ce7edb9b1c6b8881290140    No finding  14  R10 0.0 0.0 0.0 0.0 2540    3072

How to attack this problem, when I have images and csv?

When I converted data, it looks like:

                               Images               Class
56     d106ec9b305178f3da060efe3191499a.png         Nodule/Mass
38694  081d1700020b6bf0099f1e4d8aeec0f3.png        Lung Opacity
50141  ff8ef73390f04480aba0be7810ef94cf.png          No finding
233    253d35b7096d0957bd79cfb4b1c954e1.png          No finding
2166   1951e0eba7c68aa1fbd6d723f19ee7c4.png  Pleural thickening

I use image generator

# Create a train generator
train_generator = train_dataGen.flow_from_dataframe(dataframe = train,
                                                directory = 'my_directory', 
                                                x_col = 'Images',
                                                y_col = 'Class',
                                                class_mode = 'categorical',
                                                # target_size = (256, 256),
                                                batch_size = 32)

I tried something dumb, but obviously did not work.

# Create an instance
oversample = SMOTE()

# Oversample
train_ovsm, valid_ovsm = oversample.fit_resample(train_ovsm, valid_ovsm)

Gives me an error:

ValueError: could not convert string to float: '954984f75efe6890cfa45d0784a3a1e6.png'

Appreciate tips and good tutorials, cannot find anything so far.


Solution

  • I'm not sure if this answer satisfies you or not, but here is my thought. If I were you, I wouldn't try to balance it in the way you're trying it now. IMO, that's not the proper way. Your main concern is this VinBigData is highly imbalanced and you're not sure how to address it properly.

    Here are some first approaches all would adopt to address this issue in this competition.

    - External dataset 
    - Heavy and meaningful augmentation
    - Modified the loss function
    

    External Datasets

    • NIH Chest X-rays : Data
    • SIIM-ACR Pneumothorax Segmentation : Data
    • OSIC Pulmonary Fibrosis Progression : Data
    • RSNA Pneumonia Detection Challenge : Data
    • Chest X-Ray Images (Pneumonia) : Data

    What you need to do, collect all possible external samples from these datasets, combine them and make new datasets. It may take time but it worth it.

    Medical Image Augmentation

    We all know augmentation is one of the key strategies for deep learning model training. But it would make sense to choose the right augmentation. Here are some demonstrations. The main intuition is to try not to destroy sensitive information. Be careful on that.

    Class Loss Weighting

    You can modify the loss function to weight the predicted score. Here is a detailed explanation of this topic.