Search code examples
rmatrixreshape2

create 2 variable matrix using as value the frequencies of a third value


using these data here:

sample  Wu.gene bin other
N1  BA00001 Wolbachia   dontcare6
W30 BA00002 Entomo  dontcare4
N1  BA00002 Rhizobiales dontcare7
N15 BA00002 Rhizobiales dontcare6
W30 BA00004 Bacteriodetes   dontcare1
N15 BA00004 Bacteriodetes   dontcare2
W30 BA00005 Alistepes   dontcare1
N15 BA00005 Alistepes   dontcare1
N15 BA00006 Alistepes   dontcare1
W30 BA00006 Rumino  dontcare6
W30 BA00007 Wolbachia   dontcare6
W30 BA00015 Bacteriodetes   dontcare1
N1  BA00015 Rhizobiales2    dontcare6
N15 BA00015 Wolbachia   dontcare6
N1  BA00016 Entomo  dontcare3
W30 BA00016 Entomo  dontcare5
W30 BA00017 Alistepes   dontcare1
W30 BA00018 Rumino  dontcare6
N15 BA00019 Wolbachia   dontcare6
N15 BA00020 Rhizobiales dontcare6
N15 BA00021 Rhizobiales2    dontcare6
N15 BA00022 Entomo  dontcare6
N1  BA00025 Alistepes   dontcare1
W30 BA00025 Rhizobiales dontcare6
W30 BA00025 Rhizobiales dontcare6
N15 BA00025 Wolbachia   dontcare6
N1  BA00026 Rumino  dontcare6
N15 BA00026 Wolbachia   dontcare6
W30 BA00027 Rhizobiales2    dontcare6
N15 BA00031 Wolbachia   dontcare6
N15 BA00033 Wolbachia   dontcare6
N15 BA00033 Wolbachia   dontcare6
N15 BA00033 Wolbachia   dontcare6

I have been trying to create a matrix using the reshape library and the dcast function

The idea is to make a "bin" ~ "Wu.gene" matrix ("https://www.mediafire.com/file/qv9tdnnvwac6xfe/fake_data/file"), but to use the "sample" as matrix value. Let me explain:

If you look at the fake.data table the Wu.gene "BA00033" occurs 3 times in the bin "Wolbachia" and all 3 times are within the same "N15" sample. However, the Wu.gene "BA00016" occurs 2 times in the bin "Entomo" but in 2 different samples: "N1" and "W30".

I can easily construct a Wu.gene ~ bin matrix that will show me the number of times a Wu.gene is in the same bin (regardless if its in the same sample or not)

bin BA00016 BA00033
Entomo  2   0
Wolbachia   0   3

but I cannot specify that instead I want a matrix showing the times that it occurs in the same sample which would look sth like this

bin BA00016 BA00033
Entomo  2   0
Wolbachia   0   1

I tried

fake<-read.table(fake_data, header=T)
dcast(data=fake, formula=bin ~ Wu.gene, value.var = "sample")

but it keeps giving me the number of occurrences of Wu.gene ~ bin and I dont know how to specify that I want it to look into the "sample" column for the values

any help will be greatly appreciated!


Solution

  • I think you can use fun.aggregate to pass a function to apply which in this case would be uniqueN i.e to count unique values.

    library(data.table)
    dcast(setDT(fake), bin ~ Wu.gene, value.var = "sample", 
          fill = 0, fun.aggregate = uniqueN)
    

    Or using pivot_wider :

    tidyr::pivot_wider(fake, names_from = Wu.gene, values_from = sample, 
                       values_fn = n_distinct, id_cols = bin, values_fill = 0)