Search code examples
pythonmachine-learningxgboostlightgbm

Upweight Or adding weight to the downsampled examples


Hi I have down sampled my dataset and i need help in Up weight Or adding weight to the down-sampled examples. See below Code

#Separating majority and minority classes
df_majority = data[data.Collected_ind == 1]
df_minority =  data[data.Collected_ind == 0]

# Downsample majority class
df_majority_downsampled = resample(df_majority, 
                                 replace=False,    # sample without replacement
                                 n_samples=152664,     # to match minority class
                                 random_state=1) # reproducible results

# Combining minority class with downsampled majority class
df_downsampled = pd.concat([df_majority_downsampled, df_minority])

# Display new class counts
df_downsampled.Collected_ind.value_counts()
df_downsampled['Collected_ind'].value_counts()
df_downsampled['Collected_ind'].value_counts(normalize=True)

#Randomly shuffle the rows.
df_downsampled = df_downsampled.sample(frac=1)

df_downsampled.to_csv("Sampled_Data.csv", index=False)
#Generate a train and test dataset 
train = df_downsampled.sample(frac=0.8)
test = df_downsampled.drop(train.index)

train.to_csv("trainNew.csv", index=False)
test.to_csv("testNew.csv", index=False)   

Solution

  • Your question actually helped me answer my own question because I was looking for this syntax. Since I'm here anyways, I'll show you what I am doing. I don't know if your definition of weight is the same as mine, but here's what we use:

    class_weight = (original_class_count/original_row_count) / (new_class_count/new_row_count)
    

    So to reformat your code, I would replace n_samples with len(df_minority) and then add the formula above as a column in your dataframes by dynamically using length of the various dataframes.

    Perhaps something like

    df_downsampled['weight']=np.where(df_downsampled['Collected_Ind']==1,(len(df_majority) / len(data) ) / ( len(df_minority) / len(df_minority) *2),(len(df_minority) / len(data) ) / ( len(df_minority) / len(df_minority) *2))