Example data:
library(data.table)
set.seed(1)
DT <- data.table(panelID = sample(10,10),
some_NA = sample(0:5, 6),
some_NA_factor = sample(0:5, 6),
Group = c(rep(1,20),rep(2,20),rep(3,20),rep(4,20),rep(5,20)),
Time = rep(seq(as.Date("2010-01-03"), length=20, by="1 month") - 1,5),
wt = 15*round(runif(100)/10,2),
Income = round(rnorm(10,-5,5),2),
Happiness = sample(10,10),
Sex = round(rnorm(10,0.75,0.3),2),
Age = sample(100,100),
Height= 150*round(rnorm(10,0.75,0.3),2))
# ERRORS
DT [1:5,11] <- 0
I have some errors in the Height
registration of my data. Since the data is panel data, I should be able to deduce the actual Height
from the other observations. To automise this process I thought of replacing a value by the median if the value is more than 50cm away from the median:
setDT(DT)[abs(median(Height, na.rm = TRUE) - Height) > 50 , newheight := median(Height, na.rm = TRUE), by=panelID]
Somehow the by
argument is however not working, leading to the following result:
> table(DT$newheight)
0 27 165
4 10 10
Could anyone tell me what I am doing wrong?
Your problem is in the order of operations. First, the filter (i.e. the bit before the comma) is being applied to the whole table, so the median calculated here is the median of the whole Height column. Then, the bit after the comma is being applied within each group, to the filtered data. So you're actually using the median of the outliers, rather than the median of the whole group (which I assume is what you intended).
One fix could be to do it in stages:
setDT(DT)
DT[, median.height:= median(Height, na.rm = TRUE), by='panelID']
DT[abs(Height - median.height) > 50, newheight:= median.height]
Or, it's possible you intended replacing the last line with this instead:
DT[, newheight:= fifelse(abs(Height - median.height) > 50, median.height, Height)]