Search code examples
rstatisticssummarydescribe

R convert summary result (statistics with all dataframe columns) into dataframe


[I'm new to R...] I have this dataframe:

df1 <- data.frame(c(2,1,2), c(1,2,3,4,5,6), seq(141,170)) #create data.frame
names(df1) <- c('gender', 'age', 'height') #column names

I want the df1's summary in a dataframe object that looks like this:

         count     mean    std      min      25%      50%      75%      max
age    30.0000   3.5000 1.7370   1.0000   2.0000   3.5000   5.0000   6.0000
gender 30.0000   1.6667 0.4795   1.0000   1.0000   2.0000   2.0000   2.0000
height 30.0000 155.5000 8.8034 141.0000 148.2500 155.5000 162.7500 170.0000

I've generated this in Python with df1.describe().T. How can I do this in R?

It would be a gratis if my summary dataframe would contain the "dtype", "null" (number of NULL values), (number of) "unique" and "range" values as well to have a comprehensive summary statistics:

         count     mean    std      min      25%      50%      75%      max  null  unique  range  dtype
age    30.0000   3.5000 1.7370   1.0000   2.0000   3.5000   5.0000   6.0000     0       6      5  int64
gender 30.0000   1.6667 0.4795   1.0000   1.0000   2.0000   2.0000   2.0000     0       2      1  int64
height 30.0000 155.5000 8.8034 141.0000 148.2500 155.5000 162.7500 170.0000     0      30     29  int64

The Python code of above result is:

df1.describe().T.join(pd.DataFrame(df1.isnull().sum(), columns=['null']))\
    .join(pd.DataFrame.from_dict({i:df1[i].nunique() for i in df1.columns}, orient='index')\
    .rename(columns={0:'unique'}))\
    .join(pd.DataFrame.from_dict({i:(df1[i].max() - df1[i].min()) for i in df1.columns}, orient='index')\
    .rename(columns={0:'range'}))\
    .join(pd.DataFrame(df1.dtypes, columns=['dtype']))

Thank you!


Solution

  • I commonly use a little function (adapted from a script found on the net) to do this kind of transformation:

    sumstats = function(x) { 
      null.k <- function(x) sum(is.na(x))
      unique.k <- function(x) {if (sum(is.na(x)) > 0) length(unique(x)) - 1
        else length(unique(x))}
      range.k <- function(x) max(x, na.rm=TRUE) - min(x, na.rm=TRUE)
      mean.k=function(x) {if (is.numeric(x)) round(mean(x, na.rm=TRUE), digits=2)
        else "N*N"} 
      sd.k <- function(x) {if (is.numeric(x)) round(sd(x, na.rm=TRUE), digits=2)
        else "N*N"} 
      min.k <- function(x) {if (is.numeric(x)) round(min(x, na.rm=TRUE), digits=2)
        else "N*N"} 
      q05 <- function(x) quantile(x, probs=.05, na.rm=TRUE)
      q10 <- function(x) quantile(x, probs=.1, na.rm=TRUE)
      q25 <- function(x) quantile(x, probs=.25, na.rm=TRUE)
      q50 <- function(x) quantile(x, probs=.5, na.rm=TRUE)
      q75 <- function(x) quantile(x, probs=.75, na.rm=TRUE)
      q90 <- function(x) quantile(x, probs=.9, na.rm=TRUE)
      q95 <- function(x) quantile(x, probs=.95, na.rm=TRUE)
      max.k <- function(x) {if (is.numeric(x)) round(max(x, na.rm=TRUE), digits=2)
        else "N*N"} 
    
      sumtable <- cbind(as.matrix(colSums(!is.na(x))), sapply(x, null.k), sapply(x, unique.k), sapply(x, range.k), sapply(x, mean.k), sapply(x, sd.k),
                        sapply(x, min.k), sapply(x, q05), sapply(x, q10), sapply(x, q25), sapply(x, q50),
                        sapply(x, q75), sapply(x, q90), sapply(x, q95), sapply(x, max.k)) 
    
      sumtable <- as.data.frame(sumtable); names(sumtable) <- c('count', 'null', 'unique',
                                                                'range', 'mean', 'std', 'min', '5%', '10%', '25%', '50%', '75%', '90%',
                                                                '95%', 'max') 
      return(sumtable)
    } 
    sumstats(df1)
            count   null    unique  range   mean    std     var     min     5%      10%     25%     50%     75%     90%     95%     max
    gender  30.00   0.00    2.00    1.00    1.67    0.48    0.23    1.00    1.00    1.00    1.00    2.00    2.00    2.00    2.00    2.00
    age     30.00   0.00    6.00    5.00    3.50    1.74    3.02    1.00    1.00    1.00    2.00    3.50    5.00    6.00    6.00    6.00
    height  30.00   0.00    30.00   29.00   155.50  8.80    77.50   141.00  142.45  143.90  148.25  155.50  162.75  167.10  168.55  170.00
    

    You might easily adapt it to add more descriptive columns, such as quantiles, nulls, range, etc. It does return a data.frame. You also might want to specify in advance the behaviour with NAs in the arguments.

    Hope it helps.