Search code examples
rdata.tablesamplepanel-data

Sample unbalanced panel data


I have an unbalanced panel data as follows:

library(data.table)
unbalanced.panel = structure(list(firm = c("A", "A", "A", "A", "B", "B", "B", "C", 
"C", "D", "D"), year = c(2010, 
2011, 2012, 2013, 2010, 2011, 2012, 2011, 2012, 2012, 2013), 
    charac1 = c("x", "x", "x", "x", "y", "y", "z", "z", "g", 
    "h", "h"), var1 = c(11, 12, 13, 14, 15, 18, 15, 29, 31, 13, 
    2)), row.names = c(NA, -11L), class = c("tbl_df", "tbl", 
"data.frame"))

I would like get a random sample of say 0.5 firms from this dataset.

I previously used the following function:

group_sampler <- function(data, group_col, sample_fraction){
  data <- data.table(data)
  # this function samples sample_fraction <0,1> from each group in the data.table
  # inputs:
  #   data - data.table
  #   group_col - column(s) used to group by
  #   sample_fraction - a value between 0 and 1 indicating what % of each group should be sampled
  data[,.SD[sample(.N, ceiling(.N*sample_fraction))],by = eval(group_col)]
}

But this does not work for the (unbalanced) panel.

I thought I should probably just randomly sample (all) the firms first:

sample_fraction = 0.5
N=2
sampled_firms <- sample(unbalanced.panel$firm, ceiling(N*sample_fraction))

And then subset the panel with these sampled firms:

unbalanced.panel_A <- unbalanced.panel[firm %in% sampled_firms,]

And create a second data set, with the rows which are not present in the first one.

unbalanced.panel_B <- setDT(unbalanced.panel)[!unbalanced.panel_A, on = names(unbalanced.panel)]

This works, but I would like to use the function and the grouping option.

How should I adapt the function to sample the panel?

EDIT:

I am wondering if it is not possible to simply sample unique rows (by firm), immediately in the last line of the function..


Solution

  • Try this:

    group_sampler <- function(data, group_col, frac) {
      data <- as.data.table(data)
      counts <- data[, .(n = .N), by = eval(group_col) ]
      sample_size <- min(ceiling(counts$n * frac))
      data[, .SD[sample(.N, size = sample_size),], by = eval(group_col)]
    }
    
    group_sampler(unbalanced.panel, "firm", 0.5)
    #      firm  year charac1  var1
    #    <char> <num>  <char> <num>
    # 1:      A  2012       x    13
    # 2:      B  2011       y    18
    # 3:      C  2012       g    31
    # 4:      D  2012       h    13
    group_sampler(unbalanced.panel, "firm", 0.8)
    #      firm  year charac1  var1
    #    <char> <num>  <char> <num>
    # 1:      A  2010       x    11
    # 2:      A  2011       x    12
    # 3:      B  2012       z    15
    # 4:      B  2011       y    18
    # 5:      C  2011       z    29
    # 6:      C  2012       g    31
    # 7:      D  2013       h     2
    # 8:      D  2012       h    13