Search code examples
rdata-structurestransposevariable-types

R variable types changes after transposing data frame


I've always been confused by the variable types in R. Now I encountered a problem after transposing a data frame.

For example, I'm using table() to get a count of each factor in a certain vector:

data(iris)

count <- as.data.frame(table(iris$Species))
typeof(count$Var1)
# [1] "integer"

typeof(count$Freq)
# [1] "integer"

My 1st question would be, why is count$Var1 "integer"? Can strings be "integer" too? But this does not matter because I can change the type by count$Var1 <- as.character(count$Var1), then typeof(count$Var1) becomes "character".

Now I transpose this data frame by transposed_count <- as.data.frame(t(count)). But I get confused because:

typeof(transposed_count[1,])
[1] "list"

typeof(transposed_count[2,])
[1] "list"

transposed_count[2,]
     V1 V2 V3
Freq 50 50 50

For consequent use, I need transposed_count[2,] to be a numeric vector like:

transposed_count[2,]
[1] 50 50 50

How can I do that? And why did them become "list" after t()? Sorry if it's a stupid question. Thanks!


Solution

  • My 1st question would be, why is count$Var1 "integer"?

    Because factors are have integer storage type

    > is.factor(count$Var1)
    [1] TRUE
    

    and the "strings" in the iris data.frame, as is typical in R, are stored as factors.

    And why did them become "list" after t()?

    When you transpose you get a matrix, and matrices must have the same storage class for each entry. What you'll actually get first is a matrix of characters, as the integer values will be coerced. Then, when you subsequently change to a data.frame, those characters will by default be coerced to (new) factors.

    > t(count)
         [,1]     [,2]         [,3]       
    Var1 "setosa" "versicolor" "virginica"
    Freq "50"     "50"         "50" 
    
    > transposed_count <- as.data.frame(t(count))
    
    > transposed_count[2,1]
    Freq 
      50 
    Levels: 50 setosa
    > as.numeric(transposed_count[2,1])
    [1] 1
    

    So what was a count of 50 now is a factor with a numeric value of 1! Not what you want.

    As to why typeof(transposed_count[1,]) is a list? As a horizontal slice of a data.frame it is actually a data.frame.

    > is.data.frame(transposed_count[2,])
    [1] TRUE
    

    And data.frames are just lists with class information.

    But how can I get a "transposed" data frame then?

    It sounds like you may want

    > library(reshape2)
    > dcast(melt(count), variable~Var1)
    Using Var1 as id variables
      variable setosa versicolor virginica
    1     Freq     50         50        50
    

    after I read all samples in, I'm gonna rbind all data frame

    You'll have to ensure the columns line up appropriately. Depending on the analysis to come it may be more natural to rbind as is with another column indicating the source.

    > count2 <- count
    > count$source = "file1"
    > count2$source = "file2"
    > (mcount <- rbind(count,count2))
            Var1 Freq source
    1     setosa   50  file1
    2 versicolor   50  file1
    3  virginica   50  file1
    4     setosa   50  file2
    5 versicolor   50  file2
    6  virginica   50  file2
    

    Now you don't have to worry about alignment if you do want to reshape later

    > dcast(melt(mcount), ...~Var1)
    Using Var1, source as id variables
      source variable setosa versicolor virginica
    1  file1     Freq     50         50        50
    2  file2     Freq     50         50        50