Search code examples

How do I remove dummy variable trap with OneHotencoding

Here is my code for CSV data extraction and transformation:

Schema schema = new Schema.Builder()
    TransformProcess transformProcess = new TransformProcess.Builder(schema)
    RecordReader reader = new CSVRecordReader(1,',');
    reader.initialize(new FileSplit(new ClassPathResource("Churn_Modelling.csv").getFile()));
    TransformProcessRecordReader transformProcessRecordReader = new TransformProcessRecordReader(reader,transformProcess);
    System.out.println("args = " + + "");

I just tried printing the first record:

args = [619, 1, 0, 0, 1, 42, 2, 0, 1, 1, 1, 101348.88, 1]

For example, the three values followed by 619 -> 1, 0, 0 I would like to keep 619 followed by 0, 0.

Basically I would like to keep the first category as base category and others are predicted from the base category to avoid any multi-collinear relationship (dummy variable trap)

How do I do that? Can anyone advice on this?


  • You could check the final transformation schema with transformProcess.finalSchema, and remove the corresponding 2nd column with

    TransformProcess transformProcess = ... same as before...