Search code examples

Determining if one value occurs once in a row of columns, but a second value doesn't occur at all

Probably a terrible title, but I have a table of qualifiers stored as "1", "2", and "3". What I'm trying to do is is look in each row (approximately 300,000 rows, but variable.) and determine where a single "3" occurs, (if it occurs more than once, I am not interested in it) and the rest of the columns in that row have a "1", and return that to a list. (The number of columns and column names change based on the input files.)

Instinctively I want to attempt this by doing nested for loops that index the row count, and then the column count, then some function that looks for one "3" and no "2"'s. --which likely means the preferred way would be some apply function correct?

Another though was to total the number of columns, add 2, and then sum the row while having a qualifier that no 2's can be in the row. But that seemed pretty complicated.

  seq                        loc   Ball   Cat   Square   Water
1 AAAAAACCAGTCCCAGTTCGGATTG  t       3     1      1       1  
2 AAAAAACCAGTCTCAGTTCGGATTG  b       1     1      3       3
3 AAAAAACCAGTCTCAGTTCGGATTG  t       1     3      2       1
4 AAAAAACCGGTCACAGTTCAGATTG  b       1     1      1       2
5 AAAAAACCGGTCACAGTTCAGATTG  t       1     1      3       1

Expected Ouput:
  seq                        loc     Group   

dput of df1:
structure(list(seq = structure(c(1L, 2L, 2L, 3L, 3L), .Label = 
loc = structure(c(2L, 1L, 2L, 1L, 2L), .Label = c("b", 
"t"), class = "factor"), Ball = c("3", "1", "1", "1", "1"
), Cat = c("1", "1", "3", "1", "1"), Square = c("1", "3", 
"2", "1", "3"), Water = c("1", "3", "1", "2", "1")), row.names = c(NA, 
-5L), class = c("tbl_df", "tbl", "data.frame"))


  • Here's a solution without tidyverse and even *apply functions. First, let's convert those four columns to integers:

    cols <- 3:6
    df1[cols] <- lapply(df1[cols], as.integer)


    df <- df1[rowSums(df1[cols]) == (3 + length(cols) - 1) & rowSums(df1[cols] == 3) == 1, ]
    df$Group <- names(df)[cols][which(t(df[cols]) == 3, arr.ind = TRUE)[, 1]]
    # A tibble: 2 x 7
    #   seq                       loc    Ball   Cat Square Water Group 
    #   <fct>                     <fct> <int> <int>  <int> <int> <chr> 
    # 1 AAAAAACCAGTCCCAGTTCGGATTG t         3     1      1     1 Ball  
    # 2 AAAAAACCGGTCACAGTTCAGATTG t         1     1      3     1 Square

    In the first line I select the right rows with two conditions: there has to be only one element equal to 3 in those cols columns (rowSums(df1[cols] == 3) == 1) and the total sum of the row has to be 3 + length(cols) - 1. Then in the second row I check which columns have 3 and pick corresponding names of df as values for Group.