Search code examples
rdata.tabletidyverse

Classification of rows/individuals based on their column output in an incidence matrix


I wrote an R function to classify rows (individuals) based on the columns output in an incidence matrix M5 for the following requirements:

M5 <- structure(c(1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1), dim = c(5L, 3L), 
                dimnames = list(c("1", "2", "3", "4", "5"), NULL))

Requirements:

Suppose Column 1 and Column 2 are negative (N) and Column 3 is positive (P) in M5. Then:

• Since Column 1 is negative, individuals 1 and 2 are confirmed negative (CN).

• Since Column 2 is negative, individuals 2, 3, and 4 are confirmed negative (CN).

• Since Column 3 is positive, at least one of individuals 4 or 5 is positive. However, individual 4 is already confirmed negative from Column 1, so the positive case must be individual 5 so individual 5 is confirmed positive (CP).

Suppose Column 1 is negative (N) and Column 2 and Column 3 are positive (P) in M5. Then:

• Since Column 1 is negative, individuals 1 and 2 are confirmed negative (CN).

• Since Column 2 is positive, at least one of individuals 2, 3, or 4 is positive. However, individual 2 is already confirmed negative from Column 1, so the positive case must be among individuals 3 and/or 4.

• Since Column 3 is positive, at least one of individuals 4 and 5 is positive.

• Therefore, individuals 1 and 2 are confirmed negative (CN), while individuals 3, 4, and 5 are suspected positive (SP).

library(tidyverse)
library(data.table,  warn.conflicts = FALSE, quietly = TRUE)

GTA <- 
  function(incidence_matrix, test_results) {
  incidence_dt <- as.data.table(incidence_matrix, keep.rownames = "individual")

  determine_status <- function(row, results) {
    if (any(row == 1 & results == "N")) {
      return("Confirm Negative (CN)")
    }
    pos_pools <- which(row == 1 & results == "P")
    neg_pools <- which(row == 1 & results == "N")
  
    if (length(pos_pools) > 0) {
      if (length(neg_pools) == 0) {
        return("Confirm Positive (CP)")
        } else {
          return("Suspected Positive (SP)")
        }
      }
    return("Undetermined")
    }
  
  # Apply the function to each individual
  incidence_dt <- 
    incidence_dt %>% 
    rowwise() %>% 
    mutate(status = determine_status(c_across(starts_with("V")), test_results)) %>% 
    ungroup()
  
  # Identify unique suspected positives in each pool and update their status
  incidence_dt <- 
    incidence_dt %>% 
    pivot_longer(cols = starts_with("V"), names_to = "pool", values_to = "value") %>% 
    group_by(pool) %>% 
    mutate(unique_sp = if_else(value == 1 & status == "Suspected Positive (SP)", 
                               sum(status == "Suspected Positive (SP)") == 1, FALSE)) %>% 
    ungroup() %>% 
    mutate(status = if_else(unique_sp & status == "Suspected Positive (SP)", 
                            "Confirm Positive (CP)", status)) %>% 
    select(individual, status) %>% 
    distinct()
  
  # Update status based on confirmed negatives (CN)
  incidence_dt <- 
    incidence_dt %>%
    rowwise() %>%
    mutate(status = case_when(
      status == "Confirm Positive (CP)" & any(incidence_dt$status == "Confirm Negative (CN)") ~ "Suspected Positive (SP)",
      TRUE ~ status
    )) %>%
    ungroup()
  
  # Convert to data.table
  incidence_dt <- as.data.table(incidence_dt)
  
  # Create columns for CN, CP, SP
  status_dt <- 
    incidence_dt %>% 
    mutate(CN = if_else(status == "Confirm Negative (CN)", 1, 0),
           CP = if_else(status == "Confirm Positive (CP)", 1, 0),
           SP = if_else(status == "Suspected Positive (SP)", 1, 0)) %>% 
    select(-status)
  
  # Return the result
  return(status_dt)
}

# Example usage
M5 <- structure(c(1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1), dim = c(5L, 3L), 
                dimnames = list(c("1", "2", "3", "4", "5"), NULL))



GTA(
    incidence_matrix = M5
  , test_results     = c("N", "N", "P")
  )
#>    individual    CN    CP    SP
#>        <char> <num> <num> <num>
#> 1:          1     1     0     0
#> 2:          2     1     0     0
#> 3:          3     1     0     0
#> 4:          4     1     0     0
#> 5:          5     0     0     1


GTA(
    incidence_matrix = M5
  , test_results     = c("N", "P", "P")
  )
#>    individual    CN    CP    SP
#>        <char> <num> <num> <num>
#> 1:          1     1     0     0
#> 2:          2     1     0     0
#> 3:          3     0     0     1
#> 4:          4     0     0     1
#> 5:          5     0     0     1

Problems

  1. My function GTA in GTA(M5, test_results = c("N", "N", "P")) classify individual 5 as SP whereas it should be CP.
  2. Any hints to improve this function.

Edited

For GTA(M5, test_results = c("P", "N", "P")), the individuals 2, 3, and 4 should be classified as CN, and the individuals 1, and 5 should be classified as CP:

  1. Since Column (Test) 2 is N, individuals 2, 3 and 4 are confirmed negative (CN).
  2. Column (Test) 1 is P, at least one of individuals 1, or 2 is positive. However, individual 2 is already confirmed negative from Column 2, so the positive case must be among individual 1. Thus individual 1 is CP.
  3. Column (Test) 3 is P, at least one of individuals 4, or 5 is positive. However, individual 4 is already confirmed negative from Column 2, so the positive case must be among individual 5. Thus individual 5 is CP.

Solution

  • Edited after clarification

    From my understanding of your function, you set out the following rules:

    1. A negative test ALWAYS results in an individual being confirmed negative
    2. For each positive test, if only one person has a positive test with no negative tests, then they are confirmed positive
    3. For each positive test, if more than one person has a positive test with no negative tests, then all these individuals are suspected positive.
    4. All the rows will contain at least one 1

    If we follow these rules to their logical conclusion, we can get our CN column by just check if each individual had at least one negative test.

    Because we will essentially always discard the positive results of any positive test if an individual has had at least one negative. We will multiply each positive test column by the opposite of CN (getting rid of any positives for individuals with a negative test result).

    From here we will check for each test if only one individual had a positive test without any negatives. If so, then they are confirmed positive. If not, then all those individuals are SP.

    At the end we return our results. The one case where this doesn't work is all negative tests, but we can just return CN = 1 for every individual if that happens.

    Lastly, we can use the gtools permutation function to get all combinations of tests and row apply through those permutations. Get the output for every permutation of N and P.

    M5 <- structure(c(1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1), dim = c(5L, 3L), 
                    dimnames = list(c("1", "2", "3", "4", "5"), NULL))
    
    GTA <- function(incidence_matrix, test_results) {
      num_ind <- nrow(incidence_matrix)
      CP <- rep(0L, num_ind)
      SP <- rep(0L, num_ind)
    
      if (!("P" %in% test_results)) {
        CN <- rep(1L, num_ind)
        return(data.frame(individual = dimnames(incidence_matrix)[[1]], CN, CP, SP))
      }
      CN <- as.integer(rowSums(incidence_matrix[, test_results == "N", drop = FALSE]) != 0)
      pos_tests <- incidence_matrix[, test_results == "P", drop = FALSE] * !CN
    
      for (i in 1:ncol(pos_tests)) {
        if (sum(pos_tests[, i]) == 1) {
          CP[pos_tests[, i] == 1] = 1L
        } else if (sum(pos_tests[, i]) > 1) {
          SP[pos_tests[, i] == 1] = 1L
        }
      }
    
      data.frame(individual = dimnames(incidence_matrix)[[1]], CN, CP, SP)
    }
    
    test_results <- gtools::permutations(c("P", "N"), n = 2, r = 3, repeats.allowed = TRUE)
    test_results
    #>      [,1] [,2] [,3]
    #> [1,] "N"  "N"  "N" 
    #> [2,] "N"  "N"  "P" 
    #> [3,] "N"  "P"  "N" 
    #> [4,] "N"  "P"  "P" 
    #> [5,] "P"  "N"  "N" 
    #> [6,] "P"  "N"  "P" 
    #> [7,] "P"  "P"  "N" 
    #> [8,] "P"  "P"  "P"
    
    apply(test_results, 1, function(test_results) GTA(M5, test_results))
    #> [[1]]
    #>   individual CN CP SP
    #> 1          1  1  0  0
    #> 2          2  1  0  0
    #> 3          3  1  0  0
    #> 4          4  1  0  0
    #> 5          5  1  0  0
    #> 
    #> [[2]]
    #>   individual CN CP SP
    #> 1          1  1  0  0
    #> 2          2  1  0  0
    #> 3          3  1  0  0
    #> 4          4  1  0  0
    #> 5          5  0  1  0
    #> 
    #> [[3]]
    #>   individual CN CP SP
    #> 1          1  1  0  0
    #> 2          2  1  0  0
    #> 3          3  0  1  0
    #> 4          4  1  0  0
    #> 5          5  1  0  0
    #> 
    #> [[4]]
    #>   individual CN CP SP
    #> 1          1  1  0  0
    #> 2          2  1  0  0
    #> 3          3  0  0  1
    #> 4          4  0  0  1
    #> 5          5  0  0  1
    #> 
    #> [[5]]
    #>   individual CN CP SP
    #> 1          1  0  1  0
    #> 2          2  1  0  0
    #> 3          3  1  0  0
    #> 4          4  1  0  0
    #> 5          5  1  0  0
    #> 
    #> [[6]]
    #>   individual CN CP SP
    #> 1          1  0  1  0
    #> 2          2  1  0  0
    #> 3          3  1  0  0
    #> 4          4  1  0  0
    #> 5          5  0  1  0
    #> 
    #> [[7]]
    #>   individual CN CP SP
    #> 1          1  0  0  1
    #> 2          2  0  0  1
    #> 3          3  0  0  1
    #> 4          4  1  0  0
    #> 5          5  1  0  0
    #> 
    #> [[8]]
    #>   individual CN CP SP
    #> 1          1  0  0  1
    #> 2          2  0  0  1
    #> 3          3  0  0  1
    #> 4          4  0  0  1
    #> 5          5  0  0  1
    

    Created on 2024-06-24 with reprex v2.1.0