Search code examples
rmatrixdiagonal

Identify all cells above or below a diagonal in an asymmetric matrix in R


I have looked around for solutions to this problem and the closest I found was: Get upper triangular matrix from nonsymmetric matrix, but this did not work for me.

I have a matrix in R with 2 columns and over 3000 rows.

head(diag_calc)
        X            Y
[1,] 0.4991733   0.05358506
[2,] 1.1758962   0.70707194
[3,] 0.2197383  -0.00148791
[4,] 0.6389240   0.24411083
[5,] 0.8708275   0.16959840
[6,] 0.9784328   0.10341456

When I plot them against each other they look like this:

enter image description here

I want to identify all the rows containing points on either extreme of the diagonals. I have tried labeling points by being in the 3rd quartile of X and 1st quartile of Y and colored them orange. I did the reverse and colored them purple. However, this metric does not capture the true biological variability in my system and it appears that identifying cells at the extremes off a diagonal (which starts at the inflection point of the quartile that I labeled) would provide a better result.

I have tried using diag, upper.tri, and lower.tri from base R, but these do not work, I think due the the asymmetrical nature of my matrix. Diag does work to calculate the inflection point where each diagonal line passes through though. As such:

diag_calc <- Ad_SF7_fc_scored_NK %>%
      select(one_of("X", "Y")) %>%
      as.matrix(.) 

diag(diag_calc) -> diag_test

diag_test
[1] 0.4991733 0.7070719

I can get the other inflection point by swapping the X and Y variables when generating my matrix.

Does anyone have a solution or advice on potential approaches to use?

Thanks!


Solution

  • Here is a way to proceed making an assumption about how you defined your diagonal lines. First create reproducible data and get the quantiles:

    set.seed(42)
    X <- rnorm(500, 1.5, .5)
    Y <- rnorm(500)
    Xq <- quantile(X)
    Yq <- quantile(Y)
    df <- data.frame(X, Y)
    

    Now plot the data and identify a line that passes through the lower left quantile intersection and the upper right quantile intersection. Then use the slope to identify parallel lines that pass through the upper left and lower right intersections:

    plot(X~Y, df, pch=20)
    abline(v=Yq[2:4], lty=3)
    abline(h=Xq[2:4], lty=3)
    diag <- lm(Xq[c(2, 4)]~Yq[c(2, 4)])
    points(Yq[c(2, 4)], Xq[c(2, 4)], cex=2, col="red", lwd=2)
    abline(diag)
    b <- coef(diag)[2]
    a1 <- Xq[4] - b * Yq[2]
    a2 <- Xq[2] - b * Yq[4]
    abline(a1, b)
    abline(a2, b)
    

    Plot 1

    Now identify the points above and below these two lines:

    res1 <- X - (a1 + b * Y)
    res2 <- (a2 + b * Y) - X
    clr <- c("black", "purple", "darkorange")
    idx <- ifelse(res1 > 0, 3, ifelse(res2 > 0, 2, 1))
    plot(X~Y, pch=20, col=clr[idx])
    abline(a1, b, col="red")
    abline(a2, b, col="red")
    

    Plot 2

    Finally add the identification of the outliers to the data:

    position <- c("inside", "below", "above")
    df$outlier <- position[idx]
    head(df)
    #           X            Y outlier
    # 1  2.185479  1.029140719  inside
    # 2  1.217651  0.914774868   below
    # 3  1.681564 -0.002456267  inside
    # 4  1.816431  0.136009552  inside
    # 5  1.702134 -0.720153545  inside
    # 6  1.446938 -0.198124330  inside
    # 7  2.255761 -1.029208806   above
    # 8  1.452670 -0.966955896  inside
    # 9  2.509212 -1.220813089   above
    # 10 1.468643  0.836207704  inside