Search code examples
rsummary

Summary of DF as matrix


I have been searching for quite a while now but couldn't really find a simple way. I have a df consisting of numerical values only and I want to Create a summary matrix out of my df.

DF
V1   V2   V3   V4   V5  ...
x1   y1   z1   1    c1  
x2   NA   z2   0    c2
x3   y3   z3   1    NA
...

V4 is originally a TRUE/FALSE variable tranformed into numerical variable which should usually work. I want to obtain the following:

    N   Mean  SD  Min  1st  Median  3rd  Max
V1
V2 
V3
V4
V5
...

with the according values for N, Mean, SD, Min, 1st, Median, 3rd, Max. I have tried the simple as.data.frame(summary(DF)) I have tried stargazer which doesn't work for some reason ( I am guessing because I have binal variables)

stargazer(DF, type= "html", nobs = TRUE, type="html", mean.sd = TRUE, median = TRUE, iqr = TRUE,
+           digits=2, align=T)

and I read something about qwraps2_summary_table. But they all seem to give a different "design" of table than what I am looking for.

I know I can also run a loop like:

for(i in (1:length(DF)){
sum$N<-(????)
sum$Mean<-mean(DF[i])
....}

But this is not the nicest solution. Any tipps? Thank you!

Here a part of my dataset

structure(list(Year = c(2011, 2012, 2013, 2014, 2015, 2016, 2017, 
2018, 2018, 2011), Occurences = c(9L, 9L, 9L, 9L, 9L, 9L, 9L, 
9L, 2L, 9L), Balance = c(-1.14, 1.05, -1.06, 1.01, 1.01, 1.01, 
-1.09, -1, -1.04, -1.03), Withdrawal = c(43200, 41080, 43400, 
43183, 42600, 42100, 45900, 46000, 3892008, 48374), Verification_SA = c(NA, 
NA, NA, NA, 1, 1, NA, 1, 1, NA), Classification_num = c(NA, NA, 
NA, NA, 3, 2, NA, 4, 4, NA), Interaction_Verification_Classification = c(NA, 
NA, NA, NA, 3, 2, NA, 4, 4, NA), KnowledgeSources = c(1, 1, 1, 
0, 1, 1, 1, 1, 1, 0), KnowledgeDischarge = c(0, 0, 0, 0, 0, 1, 
1, 1, 1, 0), Scarcity_watershed = c(NA_real_, NA_real_, NA_real_, 
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_
), Scarcity_country = c(NA, NA, NA, NA, NA, NA, NA, NA, 3.35, 
NA), Knowledge_Watershed = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0), Knowledge_Facilities = c(0, 
0, 0, 0, 0, 0, 0, 0, 1, 1), Importance_num = c(NA, NA, NA, 3, 
3, 3, 3, 3, 5, NA), DetrimentalImpacts_num = c(0, 0, 1, 0, 0, 
0, 0, 0, 0, 0), Responsibility_num = c(1, 1, 1, 2, 2, 2, 2, 3, 
3, 1)), row.names = c(NA, -10L), class = c("tbl_df", "tbl", "data.frame"
))

Solution

  • If someone finds this question later, building off @camille's suggestion to check this question. Here is one approach that is simple, although it does not provide the number of NAs.

    library(psych)
    my_summary <- do.call(rbind,lapply(DF,psych::describe,quant=c(0.25,0.75)))
    my_summary
    #                                        vars  n      mean         sd   median  trimmed     mad      min        max      range  skew kurtosis        se    Q0.25    Q0.75
    #Year                                       1 10   2014.50       2.72  2014.50  2014.50    3.71  2011.00    2018.00       7.00  0.00    -1.74      0.86  2012.25  2016.75
    #Occurences                                 1 10      8.30       2.21     9.00     9.00    0.00     2.00       9.00       7.00 -2.28     3.57      0.70     9.00     9.00
    #Balance                                    1 10     -0.23       1.07    -1.02    -0.27    0.15    -1.14       1.05       2.19  0.35    -2.05      0.34    -1.06     1.01
    #Withdrawal                                 1 10 428784.50 1216854.64 43300.00 44344.62 2535.25 41080.00 3892008.00 3850928.00  2.28     3.57 384803.22 42745.75 45975.00
    #Verification_SA                            1  4      1.00       0.00     1.00     1.00    0.00     1.00       1.00       0.00   NaN      NaN      0.00     1.00     1.00
    #Classification_num                         1  4      3.25       0.96     3.50     3.25    0.74     2.00       4.00       2.00 -0.32    -2.08      0.48     2.75     4.00
    ...
    

    Here is my original solution, using data.table.

    library(data.table)
    my_summary <- rbindlist(lapply(DF, function(x){
      as.data.frame(t(c(
        summary(x),
        SD = sd(x,na.rm=TRUE),
        N = sum(!is.na(x)))))
      })
      , fill = TRUE, use.names = TRUE,idcol="Variable")
    my_summary
    #                                   Variable     Min.   1st Qu.    Median          Mean  3rd Qu.       Max.           SD  N NA's
    # 1:                                    Year  2011.00  2012.250  2014.500   2014.500000  2016.75    2018.00 2.718251e+00 10   NA
    # 2:                              Occurences     2.00     9.000     9.000      8.300000     9.00       9.00 2.213594e+00 10   NA
    # 3:                                 Balance    -1.14    -1.055    -1.015     -0.228000     1.01       1.05 1.074800e+00 10   NA
    # 4:                              Withdrawal 41080.00 42745.750 43300.000 428784.500000 45975.00 3892008.00 1.216855e+06 10   NA
    # 5:                         Verification_SA     1.00     1.000     1.000      1.000000     1.00       1.00 0.000000e+00  4    6
    # 6:                      Classification_num     2.00     2.750     3.500      3.250000     4.00       4.00 9.574271e-01  4    6