I have been searching for quite a while now but couldn't really find a simple way. I have a df consisting of numerical values only and I want to Create a summary matrix out of my df.
DF
V1 V2 V3 V4 V5 ...
x1 y1 z1 1 c1
x2 NA z2 0 c2
x3 y3 z3 1 NA
...
V4 is originally a TRUE/FALSE variable tranformed into numerical variable which should usually work. I want to obtain the following:
N Mean SD Min 1st Median 3rd Max
V1
V2
V3
V4
V5
...
with the according values for N, Mean, SD, Min, 1st, Median, 3rd, Max.
I have tried the simple
as.data.frame(summary(DF))
I have tried stargazer which doesn't work for some reason ( I am guessing because I have binal variables)
stargazer(DF, type= "html", nobs = TRUE, type="html", mean.sd = TRUE, median = TRUE, iqr = TRUE,
+ digits=2, align=T)
and I read something about qwraps2_summary_table. But they all seem to give a different "design" of table than what I am looking for.
I know I can also run a loop like:
for(i in (1:length(DF)){
sum$N<-(????)
sum$Mean<-mean(DF[i])
....}
But this is not the nicest solution. Any tipps? Thank you!
Here a part of my dataset
structure(list(Year = c(2011, 2012, 2013, 2014, 2015, 2016, 2017,
2018, 2018, 2011), Occurences = c(9L, 9L, 9L, 9L, 9L, 9L, 9L,
9L, 2L, 9L), Balance = c(-1.14, 1.05, -1.06, 1.01, 1.01, 1.01,
-1.09, -1, -1.04, -1.03), Withdrawal = c(43200, 41080, 43400,
43183, 42600, 42100, 45900, 46000, 3892008, 48374), Verification_SA = c(NA,
NA, NA, NA, 1, 1, NA, 1, 1, NA), Classification_num = c(NA, NA,
NA, NA, 3, 2, NA, 4, 4, NA), Interaction_Verification_Classification = c(NA,
NA, NA, NA, 3, 2, NA, 4, 4, NA), KnowledgeSources = c(1, 1, 1,
0, 1, 1, 1, 1, 1, 0), KnowledgeDischarge = c(0, 0, 0, 0, 0, 1,
1, 1, 1, 0), Scarcity_watershed = c(NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_
), Scarcity_country = c(NA, NA, NA, NA, NA, NA, NA, NA, 3.35,
NA), Knowledge_Watershed = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0), Knowledge_Facilities = c(0,
0, 0, 0, 0, 0, 0, 0, 1, 1), Importance_num = c(NA, NA, NA, 3,
3, 3, 3, 3, 5, NA), DetrimentalImpacts_num = c(0, 0, 1, 0, 0,
0, 0, 0, 0, 0), Responsibility_num = c(1, 1, 1, 2, 2, 2, 2, 3,
3, 1)), row.names = c(NA, -10L), class = c("tbl_df", "tbl", "data.frame"
))
If someone finds this question later, building off @camille's suggestion to check this question. Here is one approach that is simple, although it does not provide the number of NA
s.
library(psych)
my_summary <- do.call(rbind,lapply(DF,psych::describe,quant=c(0.25,0.75)))
my_summary
# vars n mean sd median trimmed mad min max range skew kurtosis se Q0.25 Q0.75
#Year 1 10 2014.50 2.72 2014.50 2014.50 3.71 2011.00 2018.00 7.00 0.00 -1.74 0.86 2012.25 2016.75
#Occurences 1 10 8.30 2.21 9.00 9.00 0.00 2.00 9.00 7.00 -2.28 3.57 0.70 9.00 9.00
#Balance 1 10 -0.23 1.07 -1.02 -0.27 0.15 -1.14 1.05 2.19 0.35 -2.05 0.34 -1.06 1.01
#Withdrawal 1 10 428784.50 1216854.64 43300.00 44344.62 2535.25 41080.00 3892008.00 3850928.00 2.28 3.57 384803.22 42745.75 45975.00
#Verification_SA 1 4 1.00 0.00 1.00 1.00 0.00 1.00 1.00 0.00 NaN NaN 0.00 1.00 1.00
#Classification_num 1 4 3.25 0.96 3.50 3.25 0.74 2.00 4.00 2.00 -0.32 -2.08 0.48 2.75 4.00
...
Here is my original solution, using data.table
.
library(data.table)
my_summary <- rbindlist(lapply(DF, function(x){
as.data.frame(t(c(
summary(x),
SD = sd(x,na.rm=TRUE),
N = sum(!is.na(x)))))
})
, fill = TRUE, use.names = TRUE,idcol="Variable")
my_summary
# Variable Min. 1st Qu. Median Mean 3rd Qu. Max. SD N NA's
# 1: Year 2011.00 2012.250 2014.500 2014.500000 2016.75 2018.00 2.718251e+00 10 NA
# 2: Occurences 2.00 9.000 9.000 8.300000 9.00 9.00 2.213594e+00 10 NA
# 3: Balance -1.14 -1.055 -1.015 -0.228000 1.01 1.05 1.074800e+00 10 NA
# 4: Withdrawal 41080.00 42745.750 43300.000 428784.500000 45975.00 3892008.00 1.216855e+06 10 NA
# 5: Verification_SA 1.00 1.000 1.000 1.000000 1.00 1.00 0.000000e+00 4 6
# 6: Classification_num 2.00 2.750 3.500 3.250000 4.00 4.00 9.574271e-01 4 6