Search code examples
rdplyrsubset

In R $ operator is invalid for atomic vector following subsetting


I have a dataframe which I subset into three. In the origional dataframe I can split the data further on a variable, but once I subset it I can no longer do this, with the error $ operator is invalid for atomic vectors. I am unclear why this is the case, does anyone have any ideas?

I cannot really provide a minimal reproducible example but below is the code used.

#Origional dataset = CT_variable_Biom

##First splitting into three categories
CT_variable_Biom <- CT_variable_Biom %>%
  mutate(
    level_of_risk = case_when(
      high_risk == 1 ~ "high",
      medium_risk == 1 ~ "medium",
      low_risk == 1 ~ "low",
      TRUE ~ NA_character_  
    )
  )

medium_risk <- subset(CT_variable_Biom, CT_variable_Biom$level_of_risk=="medium")
high_risk <- subset(CT_variable_Biom, CT_variable_Biom$level_of_risk=="high")
low_risk <- subset(CT_variable_Biom, CT_variable_Biom$level_of_risk=="low")

#Split based on level

#This one works as normal
False_Negatives_overall <- subset(CT_variable_Biom , CT_variable_Biom$Biomarker<0.25)
#This one returns $ operator is invalid for atomic vectors
False_Negatives <- subset(medium_risk, medium_risk$Biomarker<0.25)

I assume that something in my subseting into the three categories is causing this, but I am not sure what

Many thansk.


Solution

  • The problem is that your dataframe has a column named medium_risk. When you evaluate

    False_Negatives <- subset(medium_risk, medium_risk$Biomarker<0.25)
    

    the subset() function needs to evaluate medium_risk$Biomarker. It searches the columns of the dataframe before looking for the global variable medium_risk, and it finds the column. In this case the simplest fix is the one suggested by @clp, i.e. just use

    False_Negatives <- subset(medium_risk, Biomarker<0.25)
    

    In this expression subset() will be looking for Biomarker, and it finds that column.

    This is an example of why the ?subset documentation says "This is a convenience function intended for use interactively. For programming it is better to use the standard subsetting functions like [, and in particular the non-standard evaluation of argument subset can have unanticipated consequences."

    The standard way to do this would be

    False_Negatives <- medium_risk[medium_risk$Biomarker<0.25, ]
    

    and that is unambiguous, because only standard evaluation is used.

    Personally I prefer using subset(), but I try to avoid using any variables that aren't columns of the dataframe. It's not always possible to do that; when things are complicated or when I don't have control of the column names, it's better to follow the advice from the documentation.