I'm using the one_hot
function in mltools
to convert a 2 variable molten data frame into a wide data frame where each variable (apart from an index) is a factor level.
There are 25,000 rows in the molten frame and only 2 variables - one a factor with 800 levels and one an index so I can merge back at a later point.
I'm going to use a variety of machine learning packages and hence need to represent the 800 factor levels in an acceptable way.
However, when I use one_hot
I get a frame with 801 columns, which is correct (800 factor levels + 1 index) but I still have 25,000 rows. The number of original observations as represented as unique values in the index is 1,000.
So, my question is - is it best practice for one hot variables to only have one positive value per row? Is there a disadvantage to now collapse this down so each row is a single observation?
Thanks.
I will answer the question based on the information you gave.
You basically have 25,000 indexed observations (represented by id
) of one variable (the factor variable with 800 levels, represented by val
). What you can do is:
group_by()
)freq = n()
)mltools
package is really great for that)This will leave you without your index, but with a 2 x 801 table that contains each one-hotted variable (columns 1:800) and their frequency (in column 801). Many frameworks can work with data like this very well, but it is impossible to answer specifically without further information.
> str(result)
Classes ‘data.table’ and 'data.frame': 800 obs. of 801 variables:
$ val_AAL5 : int 1 0 0 0 0 0 0 0 0 0 ...
$ val_ABP14: int 0 1 0 0 0 0 0 0 0 0 ...
$ val_ACQ8 : int 0 0 1 0 0 0 0 0 0 0 ...
$ val_ADU8 : int 0 0 0 1 0 0 0 0 0 0 ...
$ val_AEB16: int 0 0 0 0 1 0 0 0 0 0 ...
$ val_AEX17: int 0 0 0 0 0 1 0 0 0 0 ...
$ val_AGQ4 : int 0 0 0 0 0 0 1 0 0 0 ...
$ val_AHS8 : int 0 0 0 0 0 0 0 1 0 0 ...
$ val_AHV2 : int 0 0 0 0 0 0 0 0 1 0 ...
$ val_AHX16: int 0 0 0 0 0 0 0 0 0 1 ...
$ val_AIV19: int 0 0 0 0 0 0 0 0 0 0 ...
...
df <- df %>%
group_by(val) %>%
summarise(freq = n())
dt <- as.data.table(df)
result <- one_hot(dt)
library(dplyr)
library(data.table)
library(mltools)
set.seed(1701)
df <- data.frame(
id = 1:25000,
val = sample(paste0(sample(LETTERS[1:26], 800, replace = TRUE),
sample(LETTERS[1:26], 800, replace = TRUE),
sample(LETTERS[1:26], 800, replace = TRUE),
sample(1:20, 20, replace = TRUE)),
25000, replace = TRUE))
> head(df)
id val
1 1 CXC15
2 2 IPH16
3 3 ICK1
4 4 OPJ2
5 5 XSA8
6 6 JKS19