I have a data set of items downloaded from a website based on reports we generate. The idea is to remove reports that are no longer needed based on the number of downloads. The logic is basically count all the reports for the last year that have been downloaded, check if they are outside of two absolute deviations around the median for the current year, check if the report has been downloaded within the last 4 weeks and if so how many times
I have the code below which doesn't work, I was wondering if anyone can help It gives me the error: for the n_recent_downloads section
reports <- c("Report_A","Report_B","Report_C","Report_D","Report_A","Report_A","Report_A","Report_D","Report_D","Report_D")
Week_no <- c(36,36,33,32,20,18,36,30,29,27)
New.Downloads <- data.frame (Report1 = reports, DL.Week = Week_no)
test <- New.Downloads %>%
group_by(report1) %>%
summarise(n_downloads = n(),
n_recent_downloads = ifelse(sum((as.integer(DL.Week) >= (as.integer(max(DL.Week))) - 4),value,0)))
Providing a reproducible example would make life a lot easier. Nonetheless I have modified your code to do what I think you were trying to achieve.
I've split it into two so you can see what is going on. I moved the ifelse
statement to a mutate
call which gives:
library(dplyr)
New.Downloads <- data.frame(
Report1 = c("Report_A","Report_B","Report_C","Report_D","Report_A","Report_A","Report_A","Report_D","Report_D","Report_D"),
DL.Week = as.numeric(c(36,36,33,32,20,18,36,30,29,27))
)
test <- New.Downloads %>%
group_by(Report1) %>%
mutate(
median = median(DL.Week),
mad = 2 * mad(DL.Week),
check = ifelse(DL.Week > median + mad | DL.Week < median - mad, 0, DL.Week)
)
test
Source: local data frame [10 x 5]
Groups: Report1
Report1 DL.Week median mad check
1 Report_A 36 28.0 23.7216 36
2 Report_B 36 36.0 0.0000 36
3 Report_C 33 33.0 0.0000 33
4 Report_D 32 29.5 4.4478 32
5 Report_A 20 28.0 23.7216 20
6 Report_A 18 28.0 23.7216 18
7 Report_A 36 28.0 23.7216 36
8 Report_D 30 29.5 4.4478 30
9 Report_D 29 29.5 4.4478 29
10 Report_D 27 29.5 4.4478 27
Note that from your example none of the values are classed as extreme relative to the median + 2 * mad
criterion, so the check
values are identical to DL.week
.
You can then chain a summarise
onto the end of this to give you the sums.
test %>%
summarise(
n_recent_downloads = sum(check)
)
Source: local data frame [4 x 2]
Report1 n_recent_downloads
1 Report_A 110
2 Report_B 36
3 Report_C 33
4 Report_D 118