Search code examples
rdataframedplyrtidyverse

Replacing values of selected columns based on another dataframe with different size


I performed CV calculation over numeric columns of a data frame grouped by a specific column values. Then based on resulting dataframe, I replaced the related column values in first data frame. I did the operation by mixing dplyr functions, merge function, and for loop with if statement. I would like to ask you for a shorter operations of tidyverse functions if any. Here are the steps and example outcome.

I had a data frame- A

Accession Column1 Column2 Column3 Root ID
2000_1    0       0.2     14      2000  1
2000_2    0.01    0.2     17      2000  2
2001_1    0.012   0.22    11      2001  1
2001_2    0.011   0.231   17      2001  2

For this demo I have three numerical columns Column1, Column2, and Column3, but in actual data it is > 500. I subset the data frame to keep only numeric columns (Column1, Column2, Column3, and Root) except ID column. Then I grouped the rows based on column "Root" and calculated the coefficient of variance as a second data frame by using aggregate function/also using tibble functions groupby/summarize_if/ungroup. I replaced the values with NA if they are greater than 30. Here is the resulting dataframe- B

 Column1 Column2 Column3 Root
 NA      0       13.68   2000
 6.14    3.44     NA     2001 

I merged data frame A and B by using merge function based on Root column. Resulting data frame was

Column1.x Column2.x Column3.x Accession Column1.y Column2.y Column3.y Root ID 
    NA       0        13.68     2000_1    0       0.2        14       2000  1
    NA       0        13.68     2000_2    0.01    0.2        17       2000  2
    6.14    3.44      NA        2001_1    0.012   0.22       11       2001  1
    6.14    3.44      NA        2001_2    0.011   0.231      17       2001  2

Then I looped through numeric columns (Column 1, 2, 3). For Root column value 2000, I replaced Column1.y values (0 and 0.01) with Column1.x values are NA. For Root column value 2001, I didn't replace Column1.y values are not NA.

Resulting data frame was

    Accession Column1.y Column2.y Column3.y Root ID 
    2000_1    NA      0.2        14       2000  1
    2000_2    NA      0.2        17       2000  2
    2001_1    0.012   0.22       NA       2001  1
    2001_2    0.011   0.231      NA       2001  2

Solution

  • Data:

    dfa <- read.table(text="Accession Column1 Column2 Column3 Root ID
    2000_1    0       0.2     14      2000  1
    2000_2    0.01    0.2     17      2000  2
    2001_1    0.012   0.22    11      2001  1
    2001_2    0.011   0.231   17      2001  2", header = T)
    

    Libraries and Functions:

    library(tidyverse)
    
    cv <- function(x) 100 * (sd(x) / mean(x))
    

    Solution:

    If we cut to the chase and consider the end result, basically you want to replace the values in Column1:Column3 with NA if CV is greater than 30. Otherwise, you want to preserve the original values. The code below does that.

    dfa %>% 
      group_by(Root) %>% 
      mutate_at(vars(Column1:Column3),
                list(~ if(cv(.) > 30) NA else .)) 
    

    Result:

    #> # A tibble: 4 x 6
    #>   Accession Column1 Column2 Column3  Root    ID
    #>   <fct>       <dbl>   <dbl>   <dbl> <int> <int>
    #> 1 2000_1     NA       0.2        14  2000     1
    #> 2 2000_2     NA       0.2        17  2000     2
    #> 3 2001_1      0.012   0.22       NA  2001     1
    #> 4 2001_2      0.011   0.231      NA  2001     2
    


    More complicated approaches:

    If we want to follow your train of thoughts, then we'll end up with a more complicated code which is illustrated below;

    dfa %>% 
      select_if(function(col) is.numeric(col) & all(col != .$ID)) %>% 
      group_by(Root) %>% 
      summarise_each(list(cv)) %>% 
      mutate_at(vars(Column1:Column3),
                list(~ ifelse(. > 30, NA, 0))) %>% 
      left_join(dfa[,c("Root", "ID")], . , by = "Root") %>% 
      bind_rows(dfa, .) %>% 
      group_by(Root, ID) %>% 
      summarise_each(list(~ if(is.numeric(.)) sum(., na.rm = FALSE) else first(.))) %>% 
      ungroup %>% 
      select(-ID, -Root, everything())
    

    Explanation:

    1. selecting numeric columns except ID.
    2. grouping by Root.
    3. Calculating CV for all the columns.
    4. Replacing CV values greater than 30 with NA and the rest with 0. I am planing to sum these with the original values as it seems that OP is interested in preserving the NAs (i.e. greater than 30) from this CV matrix but keep the other values unchanged in the original dataset. So summing with 0 keep the latter unchanged while those NAs (na.rm = F) will affect the values.
    5. Adding ID column back by joining to make the CV matrix the same size (rowwise) of the original dataset. Moreover, it will be used for grouping later.
    6. Binding the datasets by rows.
    7. grouping by Root and ID.
    8. Summarizing numeric columns (i.e. Column1, Column2, etc.) by summing the values from original dataframe and modified CV matrix and keeping the first value from other columns (since the original dataframe was first in bind_rows that means preserving the original values).
    9. Ungrouping to avoid future conflicts.
    10. Rearranging columns in the order that OP presented.

    Another solution would be very similar to what's above, but instead of joining to get the ID column and expand the CV matrix, one can preserve them from the beginning by summarizing as list column and later unnesting them.

    dfa %>% 
      mutate(ID = as.factor(ID)) %>% 
      group_by(Root) %>% 
      summarise_each(list(~ if(is.numeric(.)) cv(.) else list(.))) %>% 
      mutate_at(vars(Column1:Column3),
                list(~ ifelse(. > 30, NA, 0))) %>% 
      unnest(cols = c(Accession, ID)) %>% 
      mutate(ID = as.integer(ID)) %>% 
      bind_rows(dfa, .) %>% 
      group_by(Root, ID) %>% 
      summarise_each(list(~ if(is.numeric(.)) sum(., na.rm = FALSE) else first(.))) %>% 
      ungroup %>% 
      select(-ID, -Root, everything())