Search code examples
rcoerciontype-coercion

(R) Question regarding type coercion when converting a data frame to a matrix in R


apologies for the rather rudimentary questions, but I haven't been able to easily find any answers, and also just want some solid confirmation on things.

I have a data frame which contains numeric, factor and ordered factor variables, and when I converted this to a matrix using as.matrix, I noted that the elements of the matrix were all characters. From this experience, I have 2 questions;

First, am I right in saying that vectors and matrices can only contain one data type, and this is why coercion occurs?

Secondly, and more importantly, what combinations of data types in a data frame lead to character matrices vs. numeric matrices etc? e.g. If I had just logical, integer and numeric types in my df, I imagine I would get a numeric matrix, is this correct? So is it just the inclusion of factors, ordered factors and/or characters in my data frame that, when converted into a matrix, brings about the coercion of every element into a character?

Thanks so much for reading, any help is appreciated :]


Solution

  • Answer to your first question: yes and no.

    Actually, a matrix is a vector with a dim attribute.

    And a vector must usually have one data type only. A list is an exception: it's a vector with list mode, and a list may also have a dim attribute.

    For instance:

    > is.vector(list(1, "a", T))
    [1] TRUE
    
    > mode(list(1, "a", T))
    [1] "list"
    
    > a <- structure(list(1, "a", T, 1+2i), dim = c(2, 2))
    > is.matrix(a)
    [1] TRUE
    
    > a
         [,1] [,2]
    [1,] 1    TRUE
    [2,] "a"  1+2i
    

    But it's still probably the reason as.matrix is doing coercion: it's much easier to convert everything to a single type and deal with a matrix with elements of a single type.

    However, it's a choice made by as.matrix, and it would be possible, though I think unadvisable, to convert a data.frame to a list-matrix, while keeping all data types intact.

    It would be inefficient: vectors can be stored in contiguous memory locations, which means 1/ no memory wasted in storing element data types, and 2/ faster processing with vectorized code 3/ external C or Fortran code expects contiguous data types, and it would be cumbersome and useless to deal with lists. I have never seen a list-matrix actually used, though I guess it might help in some circumstances.


    The answer to your second question is in the documentation of as.matrix:

    as.matrix is a generic function. The method for data frames will return a character matrix if there is only atomic columns and any non-(numeric/logical/complex) column, applying as.vector to factors and format to other non-character columns. Otherwise, the usual coercion hierarchy (logical < integer < double < complex) will be used, e.g., all-logical data frames will be coerced to a logical matrix, mixed logical-integer will give a integer matrix, etc.

    You may also have a look at the source code of as.matrix.data.frame.