Search code examples
rdplyrtidyversedata-cleaningmutate

R - data cleaning and mutate errors


I have reviewed Error: Problem with mutate() column (...) must be size 15 or 1, not 17192, How to drop columns with column names that contain specific string?, Remove columns that contain a specific word, and associated error troubleshooting.

I have a large dataset with viral data for different species in different areas - sample data below

Country    ..2  Area    Site    ID      Species Sample    Original Sample/Specimen #
<chr>     <lgl> <chr>   <chr>   <chr>   <chr>   <chr>    <chr>
Tanzania    NA  UMNP    UMNPhq  AATPH   PG     Feces    AATPHF2 
Tanzania    NA  UMNP    UMNPhq  AATPI   PG     Feces    AATPIF2 
Tanzania    NA  UMNP    UMNPhq  AATPJ   PG     Feces    AATPJF2 
Tanzania    NA  UMNP    UMNPhq  ATTPK   PG     Feces    ATTPKF2 
Tanzania    NA  UMNP    UMNPhq  AATPL   PG     Feces    AATPLF2 

Filovirus (MOD) PCR  Date (Filo MOD)
<chr>                <date>
Indeterminant        2015-03-16
Indeterminant        2015-03-16
Indeterminant        2015-03-16
Indeterminant        2015-03-16
Negative             2015-03-16

I am trying to recode a viral status, positive or negative, for every sample id (just filovirus here, but there's a lot of them, so please help code more generally)

Code I've tried - first subsetting data to only include a specific area

viral <- subset(data, Area %in% "UMNP")

Here I got rid of unwanted columns and then was able to get infection status, but it converted all other information on the sample to "NA" causing additional error codes when I try to maintain the values.

viralres <- viral %>% 
     dplyr::select(-matches(c('Performed by ()', 'performed by', 'Date of', '1Performed by', 'Performed by', "Date ()", "...2"),)) %>%
    mutate_if(is.character, ~case_when(. == "Indeterminant" ~ "0", 
                                       . == "Negative" ~ "0", 
                                       . == "Positive" ~ "1"))

Dput

structure(list(Country = c("Tanzania", "Tanzania", "Tanzania", 
"Tanzania", "Tanzania"), ...2 = c(NA, NA, NA, NA, NA), Area = c("UMNP", 
"UMNP", "UMNP", "UMNP", "UMNP"), Site = c("UMNPhq", "UMNPhq", 
"UMNPhq", "UMNPhq", "UMNPhq"), `Animal ID` = c("AATPH", "AATPI", 
"AATPJ", "ATTPK", "AATPL"), Species = c("Procolobus gordonorum", 
"Procolobus gordonorum", "Procolobus gordonorum", "Procolobus gordonorum", 
"Procolobus gordonorum"), `Sample Type` = c("Feces", "Feces", 
"Feces", "Feces", "Feces"), `Original Sample/Specimen #` = c("AATPHF2", 
"AATPIF2", "AATPJF2", "ATTPKF2", "AATPLF2"), `Filovirus (MOD) PCR` = c("Indeterminant", 
"Indeterminant", "Indeterminant", "Indeterminant", "Negative"
), `Date (Filo MOD)` = structure(c(16510, 16510, 16510, 16510, 
16510), class = "Date")), row.names = c(NA, -5L), class = c("tbl_df", 
"tbl", "data.frame"))

Solution

  • Using mutate_if(is.character, ...) will change all of your character columns. It looks like the only column you are trying to change is "Filovirus (MOD) PCR". So you could change the command to

    viral %>% 
      dplyr::select(-matches(c('Performed by ()', 'performed by', 'Date of', '1Performed by', 'Performed by', "Date ()", "...2"),)) %>%
      mutate(across(`Filovirus (MOD) PCR`, ~case_when(. == "Indeterminant" ~ "0", 
                                         . == "Negative" ~ "0", 
                                         . == "Positive" ~ "1")))
    

    for the least amount of change. That way you are only changing that column. Alternatively you could more directly mutate that single column using case_match

    viral %>% 
      dplyr::select(-matches(c('Performed by ()', 'performed by', 'Date of', '1Performed by', 'Performed by', "Date ()", "...2"),)) %>%
      mutate(`Filovirus (MOD) PCR` = case_match(`Filovirus (MOD) PCR`,"Indeterminant" ~ "0", 
                                         "Negative" ~ "0", 
                                         "Positive" ~ "1"))
    

    Note that case_match was introduced in dplyr 1.1.0