Search code examples
rpaneldata-cleaning

How do I trim data over 99% in unbalanced panel data in R?


I have a unbalanced panel data. the simplified example as follows:

Firm Year  V1  V2 V3
AAA  2001  0   8   2    
AAA  2002  10  4   2    
BBB  1999  0   8   8    
BBB  2000  7   0   6    
BBB  2001  10  100 1    
BBB  2002  1   5   8    
CCC  2001  2   8   0    
CCC  2002 -98  3  10    

What I'd like to do is to trim outliers for each year across all frims for each Variable. The way of trim is to assign NA to those values > 99% and < 1% of the data. For example, for year 2000, for Var_1, assign NA to all the values that > 99% or < 1%. Idealy, after trimming the data will look like this:

Firm Year  V1  V2 V3
AAA  2001  0   8   2    
AAA  2002  10  4   2    
BBB  1999  0   8   8    
BBB  2000  7   0   6    
BBB  2001  10  NA  1    
BBB  2002  1   5   8    
CCC  2001  2   8   0    
CCC  2002  NA  3   10   

The real data is massive with lot of unbalanced data structure.


Solution

  • You can try the following to mark the outliers for each variable for each year at once.

    data.frame(lapply(df[3:5],                                         # variable indices 3:5
               function(y) ave(y, df$Year, FUN=function(x) 
                                 ifelse(x < quantile(x, .01) | x > quantile(x, .99), NA, x)))) 
    
    #   V1 V2 V3
    # 1 NA  8 NA
    # 2 NA  4 NA
    # 3  0  8  8
    # 4  7  0  6
    # 5 NA NA  1
    # 6  1 NA  8
    # 7  2  8 NA
    # 8 NA NA NA