Trying to figure out how coercion of factors/ dataframe works in R. I am trying to plot boxplots for a subset of a dataframe. Let's see step-by-step
x = rnorm(30, 1, 1)
Created a vector x with normal distribution
c = c(rep("x1",10), rep("x2",10), rep("x3",10))
Created a character string to later use as a factor for plotting boxplots for x1, x2, x3
df = data.frame(x,c)
combined x and c into a data.frame. So now we would expect class
of df
: dataframe, df$x
: numeric, df$c
: factor (because we sent c into a dataframe) and is.data.frame
and is.list
applied on df
should give us TRUE
and TRUE
. (I assumed that all dataframes are lists as well? and that's why we are getting TRUE
for both checks.)
And that's what happens below. All good till now.
class(df)
#[1] "data.frame"
is.data.frame(df)
#[1] TRUE
is.list(df)
#[1] TRUE
class(df$x)
#[1] "numeric"
class(df$c)
#[1] "factor"
Now I plot the spread of x
grouped using factors present in c
. So the first argument is x ~ c
. But I want boxplots for just two factors: x1
and x2
. So I used a subset
argument in boxplot function.
boxplot(x ~ c, subset=c %in% c("x1", "x2"), data=df)
This is the plot we get, notice since x3 is a factor, it is still plotted i.e. we still got 3 categories on x-axis of the boxplot inspite of subsetting to 2 categories.
So, one solution I found was to change the class of df
variables into numeric
and character
class(df)<- c("numeric", "character")
boxplot(x ~ c, subset=c %in% c("x1", "x2"), data=df)
But if we just run the same checks, we ran before doing this coercion, on all variables, we get these outputs.
Anything funny?
class(df)
#[1] "numeric" "character"
is.data.frame(df)
#[1] FALSE
is.list(df)
#[1] TRUE
class(df$x)
#[1] "numeric"
class(df$c)
#[1] "factor"
Check out that df $ c (the second variable containing caegories x1, x2, x3) is still a factor!
And df stopped being a list
(so was it ever a list?)
And what did we do exactly by class(df)<- c("numeric", "character")
this coercion if not changing the datatype of df $ c?
my questions for tldr version:
Are all dataframes
, also lists
in R?
Why did our boxplot dropped x3 in the 2nd case (when we coerced class(df)
into numeric
and character
?
If we did coerce factor into characters by doing the above steps, why is still showing that variable's class is factor
?
And why did df
stopped being a dataframe
after we did the above steps?
The answers make more sense if we take your questions in a different order.
Are all dataframes, also lists in R?
Yes. A data frame is a list of vectors (the columns).
And why did df stopped being a list after we did the above steps?
It didn't. It stopped being a data frame, because you changed the class with class(df)<- c("numeric", "character")
. is.list(df)
returns TRUE still.
If we did coerce factor into characters by doing the above steps, why is still showing that variable's class is factor?
class(df)
operates on the df
object itself, not the columns. Look at str(df)
. The factor column is still a factor. class(df)
set the class attribute on the data frame object itself to a vector.
Why did our boxplot dropped x3 in the 2nd case (when we coerced class(df) into numeric and character?
You've messed up your data frame object by explicitly setting the class attribute of the object to a vector c("numeric", "character")
. It's hard to predict the full effects of this. My best guess is that boxplot or the functions that draw the axes accessed the class attribute of the data frame somehow.
To do what you really wanted:
x = rnorm(30, 1, 1)
c = c(rep("x1",10), rep("x2",10), rep("x3",10))
df = data.frame(x,c)
df$c <- as.character(df$c)
or
x = rnorm(30, 1, 1)
c = c(rep("x1",10), rep("x2",10), rep("x3",10))
df = data.frame(x,c, stringsAsFactors=FALSE)