Search code examples
rpanel-data

R Winsorizing individual variables in different panel data years


I have a problem that I can not fix myself, having tried multiple ways myself.

Goal: I want to winsorize individual variables for different event years.

That is, I want to winsorize all observations for variable var 1 (and var 2, var 3 etc.) for each event year (-5,-4,-3...+4,+5)

Imagine the following data structure (pdata.frame, object of the plm package, behaves similarly to a normal data frame as far as I am aware):

#firm-year observations with different numeric variables, sequences only for illustrative purposes, true variables are empirical observations
firms <- c(rep("firm a",11), rep("firm b", 11))
years <- seq(-5,5,1)
var1 <- seq(0,2.1,0.1)
var2 <- seq(0,21,1)
var3 <- seq(0,1,1/21)

#create panel data
df <- data.frame(cbind(firms,years,var1,var2,var3))
library(plm)
p.df <- pdata.frame(df, index=c("firms","years"))

# NAs in my variables exist 
p.df[12,3] = NA
p.df[10,5] = NA

p.df

My data frame then looks like this:

           firms years var1 var2               var3
firm a--1 firm a    -1  0.4    4   0.19047619047619
firm a--2 firm a    -2  0.3    3  0.142857142857143
firm a--3 firm a    -3  0.2    2 0.0952380952380952
firm a--4 firm a    -4  0.1    1 0.0476190476190476
firm a--5 firm a    -5    0    0                  0
firm a-0  firm a     0  0.5    5  0.238095238095238
firm a-1  firm a     1  0.6    6  0.285714285714286
firm a-2  firm a     2  0.7    7  0.333333333333333
firm a-3  firm a     3  0.8    8  0.380952380952381
firm a-4  firm a     4  0.9    9               <NA>
firm a-5  firm a     5    1   10  0.476190476190476
firm b--1 firm b    -1 <NA>   15  0.714285714285714
firm b--2 firm b    -2  1.4   14  0.666666666666667
firm b--3 firm b    -3  1.3   13  0.619047619047619
firm b--4 firm b    -4  1.2   12  0.571428571428571
firm b--5 firm b    -5  1.1   11  0.523809523809524
firm b-0  firm b     0  1.6   16  0.761904761904762
firm b-1  firm b     1  1.7   17   0.80952380952381
firm b-2  firm b     2  1.8   18  0.857142857142857
firm b-3  firm b     3  1.9   19  0.904761904761905
firm b-4  firm b     4    2   20  0.952380952380952
firm b-5  firm b     5  2.1   21                  1

My obviously has more observations and variables.

The closest solution I found was by Masato. robustHD::winsorize, which Masato used, is not compatible with data that has NAs though. DescTools:Winsorize in turn does not accept data frames and will winsorize my data over all the variable columns which does not make sense.

I guess the solution lies somewhere in writing another for loop and using DescTools::Winsorize on the individual columns after having extracted them for single event years (-5 to +5). Another way would be to make the robustHD::winsorize function somehow handle NAs and apply it to the whole frame of var1 to var3.

I highly appreciate your help on this and would love to here your suggestions.

Best regards, cork


Solution

  • Try this:

    library(tidyverse)
    
    p.df%>%
      as.tibble()%>%
      group_by(years)%>%
      mutate(across(c(var1,var2,var3),as.double))%>%
      mutate(across(c(var1,var2,var3),~DescTools::Winsorize(.x,na.rm=T)))
    ```