I have an unbalanced panel data as follows:
library(data.table)
unbalanced.panel = structure(list(firm = c("A", "A", "A", "A", "B", "B", "B", "C",
"C", "D", "D"), year = c(2010,
2011, 2012, 2013, 2010, 2011, 2012, 2011, 2012, 2012, 2013),
charac1 = c("x", "x", "x", "x", "y", "y", "z", "z", "g",
"h", "h"), var1 = c(11, 12, 13, 14, 15, 18, 15, 29, 31, 13,
2)), row.names = c(NA, -11L), class = c("tbl_df", "tbl",
"data.frame"))
I would like get a random sample of say 0.5
firms from this dataset.
I previously used the following function:
group_sampler <- function(data, group_col, sample_fraction){
data <- data.table(data)
# this function samples sample_fraction <0,1> from each group in the data.table
# inputs:
# data - data.table
# group_col - column(s) used to group by
# sample_fraction - a value between 0 and 1 indicating what % of each group should be sampled
data[,.SD[sample(.N, ceiling(.N*sample_fraction))],by = eval(group_col)]
}
But this does not work for the (unbalanced) panel.
I thought I should probably just randomly sample (all) the firms first:
sample_fraction = 0.5
N=2
sampled_firms <- sample(unbalanced.panel$firm, ceiling(N*sample_fraction))
And then subset the panel with these sampled firms:
unbalanced.panel_A <- unbalanced.panel[firm %in% sampled_firms,]
And create a second data set, with the rows which are not present in the first one.
unbalanced.panel_B <- setDT(unbalanced.panel)[!unbalanced.panel_A, on = names(unbalanced.panel)]
This works, but I would like to use the function and the grouping option.
How should I adapt the function to sample the panel?
EDIT:
I am wondering if it is not possible to simply sample unique rows (by firm
), immediately in the last line of the function..
Try this:
group_sampler <- function(data, group_col, frac) {
data <- as.data.table(data)
counts <- data[, .(n = .N), by = eval(group_col) ]
sample_size <- min(ceiling(counts$n * frac))
data[, .SD[sample(.N, size = sample_size),], by = eval(group_col)]
}
group_sampler(unbalanced.panel, "firm", 0.5)
# firm year charac1 var1
# <char> <num> <char> <num>
# 1: A 2012 x 13
# 2: B 2011 y 18
# 3: C 2012 g 31
# 4: D 2012 h 13
group_sampler(unbalanced.panel, "firm", 0.8)
# firm year charac1 var1
# <char> <num> <char> <num>
# 1: A 2010 x 11
# 2: A 2011 x 12
# 3: B 2012 z 15
# 4: B 2011 y 18
# 5: C 2011 z 29
# 6: C 2012 g 31
# 7: D 2013 h 2
# 8: D 2012 h 13