Search code examples
rfor-loopdplyrsubset

Filtering or subsetting within a for loop


I am trying to write a for loop to check if the relative abundances of observations based on grouped set of variables add up to 100. In the simplified example below, I want to check if all the relative abundance (RelAb) values associated with batch A1 add up to 100.

Batch Reads RelAb
A1 28431 72.94
A1 10549 27.06
B1 19315 85.96
B1 3155 14.04

If I were to check each batch one by one I would have to repeat the following code and change Batch to a different object each time.

test.batch <- data.batch %>%
  dplyr::filter(Batch == "A1")
sum(test.batch$RelAbByBatch)

I was able to get values of 100 for each batch I checked manually, but I didn't want to repeat the same line of code again and again.

So I tried writing a for loop:

Batches <- c("A1", "A2", "A3", "A4", "B1", "B2", "B3", "B4", "B5", "B6", "B7")
for(i in Batches) {
  filtered.batch <- data.batch %>%
     dplyr::filter(Batch %in% Batches)
  print(sum(filtered.batch$RelAb))

However, the loop worked but the results from each variable did not add up to 100:

[1] 1100
[1] 1100
[1] 1100
[1] 1100
[1] 1100
[1] 1100
[1] 1100
[1] 1100
[1] 1100
[1] 1100
[1] 1100

Incidentally, the length of the Batches vector was 11 but I'm not sure how/why the correct result of 100 multiplied itself by 11.

I also tried subsetting instead of dplyr::filter but got the same result as above.

for(i in Batches) {
  filtered.batch <- data.batch[data.batch$Batch %in% Batches]
  print(sum(filtered.batch$Batch))
}

I'm sure a very simple solution would solve this issue (which is not even urgent because repeating a line of code 11 times isn't the biggest problem), but I'm very curious how this could be fixed so I can write correct code in the future. Thanks!


Solution

  • library(tidyverse)
    
    df <- read_table("Batch Reads   RelAb
    A1  28431   72.94
    A1  10549   27.06
    B1  19315   85.96
    B1  3155    14.04")
    
    
    df %>%  
      summarise(sum = sum(RelAb), 
                threshold = sum(RelAb) >= 100, 
                .by = Batch)
    
    # A tibble: 2 x 3
      Batch   sum threshold
      <chr> <dbl> <lgl>         
    1 A1      100 TRUE          
    2 B1      100 TRUE