I'm currently working with a dataset in a tibble format with 714 rows (each row corresponds to a new sequence that are specific for a given virus, but multiple sequences are from the same virus if that makes sense).
So if you look in the data, there is e.g. 21 B19 sequences.
I want to make a new column in my tibble where I group all virus-strains that exist few times (lower than 50 counts) into one group ("Others") and where all virus strains with high counts remains in each of their own group so that CMV is CMV. So that will be a new column added to a tibble where everytime a low-count strain occurs, the 'newID' will be others (See fig 1). Until now, I used 'mutate(newID = case_when(Origin == "CMV" ~ "CMV") and then grouped it manually based on counts (see Data figure), but there should be an easier and less 'hard-coding' option, right?
Data:
1 B19 21
2 BKPyV 8
3 CMV 161
4 Covid-19 68
5 EBV 204
6 FLU-A 22
7 HAdV-C 10
8 hCoV 84
9 HHV-1 27
10 HHV-2 3
11 HHV-6B 1
12 HIV-1 18
13 HMPV 3
14 HPV 37
15 JCPyV 4
16 NWV 12
17 unknown 9
18 VACV 9
19 VZV 13
I hope you can help!
You can use fct_lump()
from the forcats
package (tidyverse).
I am using the top 4 viruses based on your count:
library(forcats)
data %>%
mutate(virus = as.factor(virus)) %>%
mutate(newID = fct_lump(virus, 4, w = count))
Output is:
# A tibble: 19 × 4
id virus count newID
<dbl> <fct> <dbl> <fct>
1 1 B19 21 Other
2 2 BKPyV 8 Other
3 3 CMV 161 CMV
4 4 Covid-19 68 Covid-19
5 5 EBV 204 EBV
6 6 FLU-A 22 Other
7 7 HAdV-C 10 Other
8 8 hCoV 84 hCoV
9 9 HHV-1 27 Other
10 10 HHV-2 3 Other
11 11 HHV-6B 1 Other
12 12 HIV-1 18 Other
13 13 HMPV 3 Other
14 14 HPV 37 Other
15 15 JCPyV 4 Other
16 16 NWV 12 Other
17 17 unknown 9 Other
18 18 VACV 9 Other
19 19 VZV 13 Other
I used:
library(dplyr)
data <- tribble(
~id, ~virus, ~count,
1, "B19" , 21,
2, "BKPyV" , 8,
3, "CMV" , 161,
4, "Covid-19", 68,
5, "EBV" , 204,
6, "FLU-A" , 22,
7, "HAdV-C" , 10,
8, "hCoV" , 84,
9, "HHV-1" , 27,
10, "HHV-2" , 3,
11, "HHV-6B" , 1,
12, "HIV-1" , 18,
13, "HMPV" , 3,
14, "HPV" , 37,
15, "JCPyV" , 4,
16, "NWV" , 12,
17, "unknown" , 9,
18, "VACV" , 9,
19, "VZV" , 13
)