Search code examples
rcharactersubsetr-factor

R large data frame with factors won't shrink when subset


I have a large-ish data frame (100k Row x 50 Col) with several factor variables. I want a small subset (like 100 rows) to do some prototyping with. The problem is when I type :

train <- train[1:100,]

the size shrinks (using dim()) but it still appears to store all the factors from the original data frame (I'm measuring memory size using lsos() found here).

Is there a way to get around this? So far the only way I've found is to turn the factor variables to character strings then subset, then convert to factors again. I feel like there has to be a better way to do this.

Any suggestions?


Solution

  • Use droplevels function to get rid of the levels that are not in the new data.frame, see ?droplevels for more info.

    Example:

    > DF <- data.frame(num=1:15, letter=rep(letters[1:5], each=3),random=rnorm(15))
    > levels(DF[, 2]) # all levels
    [1] "a" "b" "c" "d" "e"
    > 
    > DF2 <- DF[1:10, ] # subseting
    > levels(DF2[, 2]) # all levels again
    [1] "a" "b" "c" "d" "e"
    > DF2[, 2] <- droplevels(DF2[, 2])
    > levels(DF2[, 2]) # only the levels contained in DF2
    [1] "a" "b" "c" "d"