I have a large dataset of which the following is a sample:
# A tibble: 450 x 546
matchcode idstd year country wt region income industry sector ownership exporter c201 c202 c203a c203b c203c c203d c2041 c2042 c205a c205b1 c205b2 c205b3 c205b4 c205b5 c205b6 c205b7 c205b8 c205b9 c205b10 c205c c205d c206a c206b c2071
<int+lbl> <dbl> <dbl> <int+l> <dbl> <dbl+> <dbl+> <dbl+lb> <dbl+> <dbl+lbl> <dbl+lb> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl+> <dbl+> <dbl+> <dbl+> <dbl+> <dbl+> <dbl+> <dbl+> <dbl+> <dbl+l> <dbl> <dbl> <dbl> <dbl> <dbl>
1 "BGD 200~ 2474 2002 Bangla~ 0.9 6 1 3 1 2 1 1994 2 100 0 0 NA 2 NA NA NA NA NA NA NA NA NA NA NA NA 1 2 NA NA 1
2 "BGD 200~ 2717 2002 Bangla~ 1.2 6 1 2 1 2 2 1986 4 100 0 0 NA 2 NA NA NA NA NA NA NA NA NA NA NA NA 1 1 2 2 2
3 "BGD 200~ 2410 2002 Bangla~ 0.8 6 1 3 1 2 1 1999 4 100 0 0 NA 2 NA NA NA NA NA NA NA NA NA NA NA NA 2 NA NA NA 1
4 "BRA 200~ 14917 2003 Brazil~ NA 4 2 8 1 2 2 1984 2 100 0 0 0 2 NA 50 1 NA NA NA NA NA NA NA NA 1 1 1 1 2 3
5 "BRA 200~ 14546 2003 Brazil~ NA 4 2 2 1 2 2 1976 2 100 0 0 0 2 NA 50 1 NA NA NA NA NA NA NA NA 1 1 1 1 2 4
6 "BRA 200~ 14709 2003 Brazil~ NA 4 2 3 1 2 2 1990 2 100 0 0 0 2 NA 100 NA 1 NA NA NA NA NA NA NA NA 1 1 1 2 4
7 "KHM 200~ 16475 2003 Cambod~ NA 2 1 20 2 2 2 1999 2 100 0 0 0 2 NA 100 NA NA NA 1 NA NA NA NA NA NA NA NA 1 2 1
8 "KHM 200~ 16298 2003 Cambod~ NA 2 1 4 3 2 2 1993 4 100 0 0 0 2 NA 100 1 NA NA NA NA NA NA NA NA NA 1 2 1 2 4
9 "KHM 200~ 16036 2003 Cambod~ 1.2 2 1 21 2 2 2 1997 2 100 0 0 0 2 NA 100 NA 1 NA NA NA NA NA NA NA NA 1 1 1 2 1
10 "CHN 200~ 17862 2002 China2~ 1.1 2 2 18 2 2 2 1993 3 49 0 51 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 2
which I summarised by using the following data table solution:
dfclevel= df[, lapply(.SD, weighted.mean, wt, na.rm = TRUE), .SDcols = cols, by=matchcode]
In this operation 7 factor variables, representing strings, are dropped, because obviously there is no weighted mean of a country name. However, since I take the mean by matchcode
which is a combination of year and country, the information from those variables which have been dropped would still make sense (they should be identical for every matchcode) and for this and other reasons, I would like to keep these columns.
So the big question is, how do I do this? Can I temporarily convert these factors to numeric values and convert them back after applying the weighted mean across the columns?
I solved my problem by sub setting the factors and taking a sample of those factors by matchcode
. This leads to the desired result because, as explained in the original post, there is no variation in the factor columns per matchcode
df <- as.data.frame(df)
is.fact <- sapply(df, is.factor)
dffactors <- df[, is.fact]
dffactors <- data.table(dffactors)
df <- data.table(df)
dffactors <- dffactors[,.SD[sample(.N, min(1,.N))],by = matchcode]
I then took the mean of the original df:
dfclevel= df[, lapply(.SD, weighted.mean, wt, na.rm = TRUE), .SDcols = cols, by=matchcode]
And merged the sample with dfclevel
dfclevel <- merge(dfclevel , dffactors, by= "matchcode", all.x = TRUE, allow.cartesian=FALSE)