Search code examples
rdataframecsvdatamatrix

How would I identify which columns and rows match between two data matrices?


I have two data matrices of different dimensions stored as objects in R (I am using Rstudio with R v4.0.2 in Windows 10):

m1 = 1 column x 44 rows (this is a list of names with no spaces).

m2 = 500,000 columns x 164 rows (this contains a string of characters, the first row being a list of names).

I want to check how many (and which) of the rows of m1 are found in m2 (meaning it will be anywhere between 0 and 44). The end goal is that I have 4000 different matrices that will substitute the place of m2, and I need to see the extent of missing entries (found in m1) in all of the m2s (i.e., I am looking at the extent of missing data of those 44 names).

I am still a beginner to R, so apologies if my description is a bit off.

I tried storing each matrix, saved as CSV files, as such:

m1 <- read.csv("names-file.csv")

m2 <- read.csv("data-file.csv")

and tried to use the row.match function in the prodlim package, and ran row.match(m1, m2) but only got numeric values. I am looking to see just a number of how many of the names from m1 (first column) are found in m2 (first column), which values those are, and what the percentage would be (x out of 44).

As an example: m1 =

Tom
Harry
Cindy
Megan
Jack

`

m2 =

Tom       XXXXXXXXXXXX----XXXXXXXX
Stephanie XXXXXXXXXXXXXXXX----XXXX
Megan     XXXXXXXXXXXXXXXXXXXXXXXX
Ryan      XXXXXXXXXXXXXXXXXXXXXX-X
David     XXXXXX---XXXXXXXXXXXXXXX
Josh      XXXXXXXXXXXXXXXXXXXXXXXX

In the m2 matrix, each name is column 1, and the each subsequent X (which represents either an A, T, C, or G) are the subsequent columns (so some columns have an A, T, C, or G, or a "-"). I am looking to write a code that would see how many of the names from m1 and found in m2 (and conversely, how much data is missing from m2 as a percentage). In this case, the desired outputs would be:

2

Tom
Megan

60% 

Here are my specific datafile using dput() (please let me know if I am using dput() correctly): m1:

structure(list(V1 = c("Taxon1", "Taxon2", "Taxon3", "Taxon4", 
"Taxon5", "Taxon6", "Taxon7", "Taxon8")), class = "data.frame", row.names = c(NA, 
-8L))

m2:

structure(list(V1 = c("Taxon1", "Taxon3", "Taxon4", "Taxon6", 
"Taxon7", "Taxon9", "Taxon10", "Taxon11", "Taxon12", "Taxon13", 
"Taxon14", "Taxon15", "Taxon16", "Taxon17", "Taxon18", "Taxon19", 
"Taxon20", "Taxon21", "Taxon22", "Taxon23", "Taxon24", "Taxon25", 
"Taxon26", "Taxon27", "Taxon28", "Taxon29", "Taxon30"), V2 = c("A", 
"A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", 
"A", "A", "A", "A", "A", "A", "C", "C", "C", "C", "C", "C", "C"
), V3 = c("G", "G", "G", "G", "G", "C", "C", "G", "G", "G", "G", 
"G", "G", "G", "G", "G", "G", "G", "G", "G", "G", "G", "G", "G", 
"G", "G", "G"), V4 = c("C", "C", "C", "C", "C", "T", "G", "C", 
"C", "C", "C", "C", "C", "C", "C", "C", "C", "C", "C", "C", "C", 
"C", "C", "C", "C", "C", "C"), V5 = c("T", "T", "G", "T", "G", 
"G", "G", "T", "T", "T", "T", "T", "T", "T", "T", "T", "T", "T", 
"T", "T", "T", "T", "T", "T", "T", "T", "T"), V6 = c("G", "G", 
"C", "G", "C", "C", "C", "G", "G", "G", "G", "G", "G", "G", "G", 
"G", "G", "G", "G", "G", "G", "G", "G", "G", "G", "G", "G"), 
    V7 = c("C", "C", "A", "C", "A", "A", "A", "C", "C", "C", 
    "C", "C", "C", "C", "C", "C", "G", "G", "G", "G", "G", "G", 
    "G", "G", "G", "G", "G"), V8 = c("T", "T", "A", "T", "A", 
    "A", "A", "T", "T", "T", "T", "T", "T", "T", "T", "T", "T", 
    "T", "T", "T", "T", "T", "T", "T", "T", "T", "T"), V9 = c("A", 
    "A", "A", "A", "A", "T", "T", "A", "A", "A", "A", "A", "A", 
    "A", "A", "A", "A", "T", "T", "T", "T", "T", "T", "T", "T", 
    "T", "T")), class = "data.frame", row.names = c(NA, -27L))

Thank you!


Solution

  • You might want to have a look at the %in% operator in R. According to your question, you might want something like this:

    m1[,1] %in% m2[,1]
    #[1]  TRUE FALSE  TRUE  TRUE FALSE  TRUE  TRUE FALSE
    

    You can then pair it with functions such as mean or sum which will help you to find the percentage as required:

    sum(m1[,1] %in% m2[,1])
    #[1] 5
    mean(m1[,1] %in% m2[,1])
    #[1] 0.625
    

    EDIT: As required by the OP in the comments of this post, there are various methods for that, my personal favourite being the which function:

    m1[which(m1[,1] %in% m2[,1]),]
    #[1] "Taxon1" "Taxon3" "Taxon4" "Taxon6" "Taxon7"
    m1[which(!(m1[,1] %in% m2[,1])),]
    #[1] "Taxon2" "Taxon5" "Taxon8"
    

    Again, this is only one method, out of many (I can count 3 right now...), so I suggest you to explore the other options...