Search code examples
rdataframeoverlap

Find rows in common between columns belonging to specific groups using R


I have a data frame of 8 columns and 6 rows. All rows are numeric and contains positive and negative values.

I have another file with 8 columns and associated group name (A, B, C or D) for each column. The first 2 columns are group A, the following 2 columns are group B, the next 2 columns are group C, and the last 2 columns are group D.

I want a code in R that allows me to calculate for each column, how many of the rows with values above 0 (X > 0) in a column have also values above 0 (x > 0) in the other columns. And the output should be grouped based on the four groups. Something like this:

Data file:

    S1  S2  S3  S4  S5  S6  S7  S8
R1  2   1   2  -1  -3   5   4  -3 
R2  4  -6   1   2   1   2   1   5
R3  3   2  -3  -9  -5  -1   4   9
R4  4  -4  -4  -6   4  -7   6   6
R5  6  -5   2   2  -7  -6   7  -6
R6  4   4  -3   3  -2   3  -4   2

Group file:

S1        S2       S3       S4       S5      S6      S7     S8
GroupA   GroupA  GroupB   GroupB   GroupC  GroupC  GroupD  GroupD

Expected output file

           S1   S2   S3  S4  S5  S6  S7  S8
Group A    6    3    3   3   2   3   5   4
Group B    4    2    3   3   1   3   3   2
Group C    4    1    2   2   2   3   3   3
Group D    6    3    3   3   2   3   5   5

Explanation for the values obtained in the expected output file:

Example 1: S1 and GroupB

The value obtained is 4, this is because S1 has values greater than 0 in all 6 rows, while R1, R2, R5 and R6 are greater than 0 in at least one of the samples of group B (S3 and S4).

Example 2: S3 and GroupD

The value obtained is 3, this is because S3 has values greater than 0 in R1, R2 and R5, and the rows are also greater than 0 in at least one of the samples of group D (S7 and S8).


Solution

  • I hope I understood correctly what you want to achieve.

    df <- read.table(
      text = "    S1  S2  S3  S4  S5  S6  S7  S8
    R1  2   1   2  -1  -3   5   4  -3 
    R2  4  -6   1   2   1   2   1   5
    R3  3   2  -3  -9  -5  -1   4   9
    R4  4  -4  -4  -6   4  -7   6   6
    R5  6  -5   2   2  -7  -6   7  -6
    R6  4   4  -3   3  -2   3  -4   2"
    )
    
    groupings_new <- matrix(c(rep("Group A", 2), rep("Group B", 2), rep("Group C", 2), rep("Group D", 2)), byrow = TRUE, nrow = 1)
    colnames(groupings_new) <- paste0("S", 1:8)
    
    # get TRUE/FALSE if a value is greater than 0
    info_df <- df > 0
    
    res <- lapply(seq_len(ncol(info_df)), function(i) {
      grouping_info <- groupings_new[, , drop = TRUE]
      
      # check if a value in the matrix and in the column of interest are both greater
      # than 0
      compare_df <- info_df & matrix(rep(info_df[, i], ncol(info_df)), nrow = nrow(info_df))
      
      # split by groups
      res <- lapply(unique(grouping_info), function(one_group) {
        group_index <- grouping_info == one_group
        # check which rows are of interest (values greater than 0) and how many are
        # there
        sum(rowSums(compare_df[, group_index, drop = FALSE]) > 0)
      })
      res_clean <- data.frame(unlist(res))
      colnames(res_clean) <- colnames(info_df[, i, drop = FALSE])
      rownames(res_clean) <- unique(grouping_info)
      res_clean
    })
    
    do.call(cbind, res)
    #>         S1 S2 S3 S4 S5 S6 S7 S8
    #> Group A  6  3  3  3  2  3  5  4
    #> Group B  4  2  3  3  1  3  3  2
    #> Group C  4  2  2  2  2  3  3  3
    #> Group D  6  3  3  3  2  3  5  4
    

    Created on 2023-02-27 by the reprex package (v1.0.0)

    My output is not completely the same as your expected output because I think there are a few mistakes.