Search code examples
rclassboxplotfactorscoercion

Understanding coercion of factors into characters in an R dataframe


Trying to figure out how coercion of factors/ dataframe works in R. I am trying to plot boxplots for a subset of a dataframe. Let's see step-by-step

x = rnorm(30, 1, 1)

Created a vector x with normal distribution

c = c(rep("x1",10), rep("x2",10), rep("x3",10))

Created a character string to later use as a factor for plotting boxplots for x1, x2, x3

df = data.frame(x,c)

combined x and c into a data.frame. So now we would expect class of df: dataframe, df$x: numeric, df$c: factor (because we sent c into a dataframe) and is.data.frame and is.list applied on df should give us TRUE and TRUE. (I assumed that all dataframes are lists as well? and that's why we are getting TRUE for both checks.)

And that's what happens below. All good till now.

class(df)
#[1] "data.frame"
is.data.frame(df)
#[1] TRUE
is.list(df)
#[1] TRUE
class(df$x)
#[1] "numeric"
class(df$c)
#[1] "factor"

Now I plot the spread of x grouped using factors present in c. So the first argument is x ~ c. But I want boxplots for just two factors: x1and x2. So I used a subset argument in boxplot function.

boxplot(x ~ c, subset=c %in% c("x1", "x2"), data=df)

This is the plot we get, notice since x3 is a factor, it is still plotted i.e. we still got 3 categories on x-axis of the boxplot inspite of subsetting to 2 categories.

So, one solution I found was to change the class of df variables into numeric and character

class(df)<- c("numeric", "character")

boxplot(x ~ c, subset=c %in% c("x1", "x2"), data=df)

New boxplot. This is what we wanted, so it worked!, we plotted boxes for just x1 and x2, got rid of x3

But if we just run the same checks, we ran before doing this coercion, on all variables, we get these outputs.

Anything funny?

class(df)
#[1] "numeric"   "character"
is.data.frame(df)
#[1] FALSE
is.list(df)
#[1] TRUE
class(df$x)
#[1] "numeric"
class(df$c)
#[1] "factor"

Check out that df $ c (the second variable containing caegories x1, x2, x3) is still a factor!

And df stopped being a list (so was it ever a list?)

And what did we do exactly by class(df)<- c("numeric", "character") this coercion if not changing the datatype of df $ c?

So to sum up,

my questions for tldr version:

  • Are all dataframes, also lists in R?

  • Why did our boxplot dropped x3 in the 2nd case (when we coerced class(df) into numeric and character?

  • If we did coerce factor into characters by doing the above steps, why is still showing that variable's class is factor?

  • And why did df stopped being a dataframe after we did the above steps?


Solution

  • The answers make more sense if we take your questions in a different order.

    Are all dataframes, also lists in R?

    Yes. A data frame is a list of vectors (the columns).

    And why did df stopped being a list after we did the above steps?

    It didn't. It stopped being a data frame, because you changed the class with class(df)<- c("numeric", "character"). is.list(df) returns TRUE still.

    If we did coerce factor into characters by doing the above steps, why is still showing that variable's class is factor?

    class(df) operates on the df object itself, not the columns. Look at str(df). The factor column is still a factor. class(df) set the class attribute on the data frame object itself to a vector.

    Why did our boxplot dropped x3 in the 2nd case (when we coerced class(df) into numeric and character?

    You've messed up your data frame object by explicitly setting the class attribute of the object to a vector c("numeric", "character"). It's hard to predict the full effects of this. My best guess is that boxplot or the functions that draw the axes accessed the class attribute of the data frame somehow.

    To do what you really wanted:

    x = rnorm(30, 1, 1)
    c = c(rep("x1",10), rep("x2",10), rep("x3",10))
    df = data.frame(x,c)
    df$c <- as.character(df$c)
    

    or

    x = rnorm(30, 1, 1)
    c = c(rep("x1",10), rep("x2",10), rep("x3",10))
    df = data.frame(x,c, stringsAsFactors=FALSE)