Search code examples
rif-statementapplysummarysapply

summary and descriptive table for mixed data in R


I want to make a function that calculates some pre-determined summary statistic measures that I can apply to any dataset. I'll start off with an example here, but this is for datasets that could have a variety of datatypes - such as character, factor, numerical, dates, containing null values, etc.

I can do this easy enough if the data is all numeric - but handling the IF scenarios w/ apply, sapply, etc is where I run into trouble with the syntax. When its all numeric I'm great since I can just do new_df = data.frame(min = sapply(mydf, 2,min).....etc....etc). I just can't get the syntax right when its more complicated like in my example below.

In the example below I have a data frame of 3 columns:

  • all numerical
  • numerical with a null
  • categorical column of data coded as a factor

I want to calculate the:

  • type...(character, factor, date, numeric, etc)
  • mean...when the data-type is numeric obviously , and excluding nulls
  • number of null values in the dataset

I think this is simple enough and I can run with it from here..

copy and paste this code and name as a variable for the data frame:

  structure(list(allnumeric = c(10, 20, 30, 40), char_or_factor =     structure(c(2L, 
3L, 3L, 1L), .Label = c("bird", "cat", "dog"), class = "factor"), 
num_with_null = c(10, 100, NA, NA)), .Names = c("allnumeric", 
"char_or_factor", "num_with_null"), row.names = c(NA, -4L), class = "data.frame")

expected solution data frame (copy and assign to a variable):

  structure(list(allnumeric = structure(c(3L, 2L, 1L), .Label = c("0", 
  "25", "numeric"), class = "factor"), char_or_factor = structure(c(2L, 
  NA, 1L), .Label = c("0", "character"), class = "factor"), num_with_null =     structure(c(3L, 
   2L, 1L), .Label = c("2", "55", "numeric"), class = "factor")), .Names =  c("allnumeric", 
  "char_or_factor", "num_with_null"), row.names = c("type", "mean", 
   "num_nulls"), class = "data.frame")

Solution

  • We can use sapply to loop over the columns, get the class, mean and number of NA elements, concatenate (c() and convert to data.frame

    as.data.frame(sapply(df1, function(x) c(class(x), mean(x, na.rm=TRUE), 
                                  sum(is.na(x)))), stringsAsFactors=FALSE)