Search code examples
rdplyrfilteringdata-cleaningsummarization

How can I mutate a column to returns a value if all observations of a certain level have the same value?


I previously asked a question regarding how to create a column identifying rows with levels that consist only of new data in a data frame/tibble using the dplyr package in R (How can I create a column identifying rows consisting only of new data in a summarized data frame?). The answers there were very helpful, however, when I went to plot the data I found out there was an issue with the way I was going about the code. Specifically, in the raw data there are several rows in which there were errors in the reported data from the parent data set that I fixed by going back to the raw data recordings. To record these, I made new rows with the corrected data from the original data sheets and retained the old data with the identifier column drop.

The problem is that I set up a filter to drop the erroneous columns before I summarised the data, so when I try to create the new column using the summarise function observations for which the level species is not new but have no unaltered observations being used to create the summary values are returned as new, which makes the dataset seem like there a lot more completely new species being added to the data set than there really are.

Here is a short set of code that produces a reproducible example...

df<-data.frame(species = c("Species1","Species1","Species2","Species2","Species3","Species3","Species4","Species4"), new=c(TRUE,TRUE,TRUE,FALSE,FALSE,TRUE,FALSE,FALSE),
var=c(1,1,2,2,8,3,4,4),
drop=c(FALSE,FALSE,FALSE,FALSE,TRUE,FALSE,FALSE,FALSE))
df2<-df%>%
  filter(drop!=TRUE)%>%
  group_by(species)%>%
  summarise(var=mean(var))

What I want to produce is a dataset like the following.

data.frame(species=c("Species1","Species2","Species3","Species4"),
new=c("TRUE","FALSE","FALSE","FALSE"),
var=c(1,2,3,4))

I tried adding a mutate call before the filter call to try and produce such a column, as shown in the code below.

df2<-df%>%
  group_by(species)%>%
  mutate(new2=(all(new)==TRUE))%>%
  filter(drop!=TRUE)%>%
  group_by(species)%>%
  summarise(var=mean(var),new=all(new2))

However, the first group_by call function doesn't seem to work in this case and I get the same results as in the above code. I am trying to figure out how to return a certain value for a column if all observations of that column have the same level. I think if I can get that I can then use the solution in the previous question to produce the column for the summarized data frame.

I realize that this is a similar issue to my previous question, but because this is an issue with creating the column without using the summarise rather than how to select values within the summarise function and adding this new information would invalidate the answers to my previous question, I submitted it as a separate question.


Solution

  • We can use

    library(dplyr)
    df %>%
        group_by(species) %>%
        mutate(new = all(new)) %>% 
        filter(!drop) %>%
        summarise(new = all(new), var = mean(var))
    

    -output

    # A tibble: 4 x 3
    #  species  new     var
    #  <chr>    <lgl> <dbl>
    #1 Species1 TRUE      1
    #2 Species2 FALSE     2
    #3 Species3 FALSE     3
    #4 Species4 FALSE     4