I am working on the output file from phyloseq object. I calculated the rel.abundance and extracted the columns I am interested in as follows: this is how I calculate the rel_abund
dat <- read_excel("selected_data.xlsx")%>% group_by(OTU)%>% mutate(rel_abund = Abundance/sum(Abundance))
dat
OTU Abundance SampleID Genotype rel_abund ASV2 4988 P35 genotype1 0.2801617614 ASV4 3894 1P-GH-R2 genotype2 0.9660133962 ASV7 3681 P53 genotype1 0.5047305636 ASV3 2149 P16 genotype4 0.3943842907
Then I need to calculate the summed relative abundance of each ASV in each genotype. Each genotype is represented by 1-5 samples, and I have 44 ASVs of 2,464 rows according to their occurrences in samples
To display each ASV along with its rel_abund across genotypes including sampleIDs, I tried this
dat %>%
count(OTU, SampleID, rel_abund, Genotype) %>%
pivot_wider(names_from = SampleID, values_from = n)
results in
OTU rel_abund Genotype 1P-R1 1P-R2 P1 .......
ASV1 0.0000000000 genotype11 1 1 NA NA NA NA NA
ASV1 0.0000000000 genotype2 NA NA 1 NA NA NA NA
Then,
dat %>%
group_by(OTU, Genotype) %>%
summarize(Summed_rel_abund = sum(rel_abund >= 0, na.rm = TRUE))
gives:
OTU Genotype Summed_rel_abund
ASV1 genotype1 1
ASV1 genotype3 3
ASV1 genotype2 1
ASV1 genotype5 3
ASV1 genotype6 2
ASV10 genotype7 5
ASV10 genotype8 5
I do not how the sum is integers and the summed values are decimal fractions. I doubt the last step and I need a correction, please! Thanks
Currently, sum(rel_abund >= 0)
is summing the TRUE
values of the >= 0
test, counting each as 1, therefore effectively just counting. To sum the values where the value >= 0 try sum(rel_abund[rel_abund >= 0], na.rm = TRUE)
:
dat %>%
group_by(OTU, Genotype) %>%
summarize(Summed_rel_abund = sum(rel_abund[rel_abund >= 0], na.rm = TRUE))
#> # A tibble: 4 x 3
#> # Groups: OTU [4]
#> OTU Genotype Summed_rel_abund
#> <chr> <chr> <dbl>
#> 1 ASV2 genotype1 0.280
#> 2 ASV3 genotype4 0.394
#> 3 ASV4 genotype2 0.966
#> 4 ASV7 genotype1 0.505
Created on 2022-03-18 by the reprex package (v2.0.1)