Search code examples
rr-caretone-hot-encoding

Multiple values per row in one hot encoding - is this recommended?


I'm using the one_hot function in mltools to convert a 2 variable molten data frame into a wide data frame where each variable (apart from an index) is a factor level.

There are 25,000 rows in the molten frame and only 2 variables - one a factor with 800 levels and one an index so I can merge back at a later point.

I'm going to use a variety of machine learning packages and hence need to represent the 800 factor levels in an acceptable way.

However, when I use one_hot I get a frame with 801 columns, which is correct (800 factor levels + 1 index) but I still have 25,000 rows. The number of original observations as represented as unique values in the index is 1,000.

So, my question is - is it best practice for one hot variables to only have one positive value per row? Is there a disadvantage to now collapse this down so each row is a single observation?

Thanks.


Solution

  • I will answer the question based on the information you gave.

    You basically have 25,000 indexed observations (represented by id) of one variable (the factor variable with 800 levels, represented by val). What you can do is:

    1. Grouping by factor variable (e.g., via group_by())
    2. Adding a frequency count (e.g., via freq = n())
    3. One-hotting your variable (the mltools package is really great for that)

    This will leave you without your index, but with a 2 x 801 table that contains each one-hotted variable (columns 1:800) and their frequency (in column 801). Many frameworks can work with data like this very well, but it is impossible to answer specifically without further information.

    > str(result)
    Classes ‘data.table’ and 'data.frame':  800 obs. of  801 variables:
     $ val_AAL5 : int  1 0 0 0 0 0 0 0 0 0 ...
     $ val_ABP14: int  0 1 0 0 0 0 0 0 0 0 ...
     $ val_ACQ8 : int  0 0 1 0 0 0 0 0 0 0 ...
     $ val_ADU8 : int  0 0 0 1 0 0 0 0 0 0 ...
     $ val_AEB16: int  0 0 0 0 1 0 0 0 0 0 ...
     $ val_AEX17: int  0 0 0 0 0 1 0 0 0 0 ...
     $ val_AGQ4 : int  0 0 0 0 0 0 1 0 0 0 ...
     $ val_AHS8 : int  0 0 0 0 0 0 0 1 0 0 ...
     $ val_AHV2 : int  0 0 0 0 0 0 0 0 1 0 ...
     $ val_AHX16: int  0 0 0 0 0 0 0 0 0 1 ...
     $ val_AIV19: int  0 0 0 0 0 0 0 0 0 0 ...
    ...
    

    Code

    df <- df %>%
        group_by(val) %>%
        summarise(freq = n()) 
    dt <- as.data.table(df)
    result <- one_hot(dt)
    

    Data

    library(dplyr)
    library(data.table)
    library(mltools)
    set.seed(1701)
    df <- data.frame(
        id = 1:25000,
        val = sample(paste0(sample(LETTERS[1:26], 800, replace = TRUE),
                sample(LETTERS[1:26], 800, replace = TRUE),
                sample(LETTERS[1:26], 800, replace = TRUE),
                sample(1:20, 20, replace = TRUE)),
            25000, replace = TRUE))
    
    > head(df)
      id   val
    1  1 CXC15
    2  2 IPH16
    3  3  ICK1
    4  4  OPJ2
    5  5  XSA8
    6  6 JKS19