I'm new to coding in R and I am having some trouble. I am trying to find the values in a column above a certain percentile (X%ile) per group, and then combining all rows below that percentile into a row others for every group.
My situation is very similar to the question here: How to use fct_lump() to get the top n levels by group and put the rest in 'other'?
Where I am grouping by two columns and trying to add rows in the second and third columns which add the name "Others" in the second column and sums all values below a percentile in the third column in the same row.
I am working with a large dataframe (df) where I have the following columns: Year (class = Integer, ie 2007, 2008, ...), SciName (class = character), and Flowers (class = numeric)
I am able to filter and only show rows with a value above a certain percentile using:
df_filter <- df %>%
filter(Flowers > quantile(Flowers, 0.7))
view(df_filter)
However, I have not been able to find a way to add the others row I need
Following the accepted answer from the similar question I linked above, I have tried:
df_Others <- df %>%
ungroup() %>%
group_by(Year) %>%
arrange(desc(Flowers)) %>%
mutate(a = row_number(-Flowers)) %>%
mutate(SciName = case_when(a < (quantile(df$Flowers, 0.7)) ~ "Others", TRUE ~ as.character(SciName))) %>%
mutate(a = case_when(a < (quantile(df$Flowers, 0.7)) ~ "Others", TRUE ~ as.character(a))) %>%
group_by(Year, SciName, a) %>%
summarize(Flowers = sum(Flowers)) %>%
arrange(Year, a) %>%
select(-a)
View(df_Others)
...but this does not work
Any suggestions on how to do this would be greatly appreciated!
EDIT:
Input:
Year SciName Flowers
2004 Liliac 2000
2004 Rose 3000
2004 Daisy 10
2004 Lily 5
2005 Liliac 20
2005 Rose 3
2005 Daisy 1000
2005 Lily 5000
... ... ...
Expected Output:
Year SciName Flowers
2004 Liliac 2000
2004 Rose 3000
2004 Others 15
2005 Daisy 1000
2005 Lily 5000
2005 Others 23
... ... ...
Without your input data and your expected output, it is difficult to determine why your current approach is not working. I recommend something like the following:
library(dplyr)
# access iris, a build in dataset
data(iris)
df = iris %>%
# all the groups that you want to work within should be listed here
group_by(Species) %>%
# this creates a new column containing the threshold for each group
# replace Sepal.Length with the numeric column you want the percentile of
mutate(threshold = quantile(Sepal.Length, 0.7)) %>%
# in case of duplicate values in Sepal.Length
# creating row-numbers to avoid merging them
mutate(rn = row_number()) %>%
# setup variable that contains the desired output groups
mutate(new_group = ifelse(Sepal.Length < threshold, "other", rn))
# pause here to inspect and confirm the new_group column gives the preferred groups
df = df %>%
group_by(Species, new_group) %>%
# summarise values as desired
summarise(num = n()) %>%
select(Species, Sepal.Length, num)
EDIT: In response to your input structure and expected output, I recommend the following:
output = df %>%
group_by(Year) %>%
mutate(threshold = quantile(Flowers, 0.7)) %>%
mutate(new_group = ifelse(Flowers < threshold, "Others", SciName)) %>%
group_by(Year, new_group) %>%
summarise(Flowers = sum(Flowers)) %>%
select(Year, SciName = new_group, Flowers)
As I am not using row numbers here, I am assuming SciName
identities are unique.