using these data here:
sample Wu.gene bin other
N1 BA00001 Wolbachia dontcare6
W30 BA00002 Entomo dontcare4
N1 BA00002 Rhizobiales dontcare7
N15 BA00002 Rhizobiales dontcare6
W30 BA00004 Bacteriodetes dontcare1
N15 BA00004 Bacteriodetes dontcare2
W30 BA00005 Alistepes dontcare1
N15 BA00005 Alistepes dontcare1
N15 BA00006 Alistepes dontcare1
W30 BA00006 Rumino dontcare6
W30 BA00007 Wolbachia dontcare6
W30 BA00015 Bacteriodetes dontcare1
N1 BA00015 Rhizobiales2 dontcare6
N15 BA00015 Wolbachia dontcare6
N1 BA00016 Entomo dontcare3
W30 BA00016 Entomo dontcare5
W30 BA00017 Alistepes dontcare1
W30 BA00018 Rumino dontcare6
N15 BA00019 Wolbachia dontcare6
N15 BA00020 Rhizobiales dontcare6
N15 BA00021 Rhizobiales2 dontcare6
N15 BA00022 Entomo dontcare6
N1 BA00025 Alistepes dontcare1
W30 BA00025 Rhizobiales dontcare6
W30 BA00025 Rhizobiales dontcare6
N15 BA00025 Wolbachia dontcare6
N1 BA00026 Rumino dontcare6
N15 BA00026 Wolbachia dontcare6
W30 BA00027 Rhizobiales2 dontcare6
N15 BA00031 Wolbachia dontcare6
N15 BA00033 Wolbachia dontcare6
N15 BA00033 Wolbachia dontcare6
N15 BA00033 Wolbachia dontcare6
I have been trying to create a matrix using the reshape library and the dcast function
The idea is to make a "bin" ~ "Wu.gene" matrix ("https://www.mediafire.com/file/qv9tdnnvwac6xfe/fake_data/file"), but to use the "sample" as matrix value. Let me explain:
If you look at the fake.data table the Wu.gene "BA00033" occurs 3 times in the bin "Wolbachia" and all 3 times are within the same "N15" sample. However, the Wu.gene "BA00016" occurs 2 times in the bin "Entomo" but in 2 different samples: "N1" and "W30".
I can easily construct a Wu.gene ~ bin matrix that will show me the number of times a Wu.gene is in the same bin (regardless if its in the same sample or not)
bin BA00016 BA00033
Entomo 2 0
Wolbachia 0 3
but I cannot specify that instead I want a matrix showing the times that it occurs in the same sample which would look sth like this
bin BA00016 BA00033
Entomo 2 0
Wolbachia 0 1
I tried
fake<-read.table(fake_data, header=T)
dcast(data=fake, formula=bin ~ Wu.gene, value.var = "sample")
but it keeps giving me the number of occurrences of Wu.gene ~ bin and I dont know how to specify that I want it to look into the "sample" column for the values
any help will be greatly appreciated!
I think you can use fun.aggregate
to pass a function to apply which in this case would be uniqueN
i.e to count unique values.
library(data.table)
dcast(setDT(fake), bin ~ Wu.gene, value.var = "sample",
fill = 0, fun.aggregate = uniqueN)
Or using pivot_wider
:
tidyr::pivot_wider(fake, names_from = Wu.gene, values_from = sample,
values_fn = n_distinct, id_cols = bin, values_fill = 0)