Search code examples
rnasplit-apply-combine

r split-apply-combine problems


I'm new to r and have a large data.frame (906 rows), and I want to (row?) split the data.frame by the first column (entries associated with the same name are together) before I apply multiple descriptive statistics (mean, standard deviation, standard error/variance, 25% and 75% confidence intervals, min, max, and median) to the rest of the columns. The amount of rows associated with each species is not the same, so it's uneven/unbalanced splits. There are lots of na's scattered in the "par" columns (every row has at least 1 entry for the columns) but I just want to ignore/skip over the na's not delete/omit the row.Heres a picture of my initial data.frame -column names are not the actual column names I'm using

I want my final output to show: a column for the name, a column for the descriptive stat, and a column of the results of the descriptive statistic (one column for each par).I've included a picture of what I want the table output to look like, if it's possible (values in par columns aren't actually the calculated stats I just put random stuff in to fill the frame) Everything I've tried so far, hasn't worked. Again, fairly new too r and I'm not really sure what I'm doing, please help.


Solution

  • Often you can find suitable data for your reproducible example by looking at what comes with R (data() will show a list of data sets and brief descriptions). For example, the iris data set is similar to yours except that the species name is the last column:

    data(iris)
    iris <- iris[, c(5, 1:4)]
    iris.splt <- split(iris[, 2:5], iris[, 1])
    

    Now we have loaded the data, moved the last column to the first position, and split the dataset by species into 3 data frames that are stored in a single list called iris.splt. The species name is the name of each part of the list and only the data are stored in the data frame for that list part. Now you need to write a function that computes the statistics you need. Here is an example based on the picture you provided, but you will probably need to change it:

    stats <- function(x) {
        quant=as.matrix(quantile(x, na.rm=TRUE))
        mean=mean(x, na.rm=TRUE)
        sd=sd(x, na.rm=TRUE)
        var=var(x, na.rm=TRUE)
        return(rbind(quant, mean, sd, var))
    }
    

    This computes the statistics for a single column. We need to run the function on each column of each part of the list using the lapply function twice and then a third time to combine the columns back together:

    iris.stats <- lapply(iris.splt, function(x) lapply(x, stats))
    iris.dfs <- lapply(iris.stats, data.frame)
    iris.dfs
    # $setosa
    #      Sepal.Length Sepal.Width Petal.Length Petal.Width
    # 0%         4.3000      2.3000      1.00000     0.10000
    # 25%        4.8000      3.2000      1.40000     0.20000
    # 50%        5.0000      3.4000      1.50000     0.20000
    # 75%        5.2000      3.6750      1.57500     0.30000
    # 100%       5.8000      4.4000      1.90000     0.60000
    # mean       5.0060      3.4280      1.46200     0.24600
    # sd         0.3525      0.3791      0.17366     0.10539
    # var        0.1242      0.1437      0.03016     0.01111
    # 
    # $versicolor
    #      Sepal.Length Sepal.Width Petal.Length Petal.Width
    # 0%         4.9000     2.00000       3.0000     1.00000
    # 25%        5.6000     2.52500       4.0000     1.20000
    # 50%        5.9000     2.80000       4.3500     1.30000
    # 75%        6.3000     3.00000       4.6000     1.50000
    # 100%       7.0000     3.40000       5.1000     1.80000
    # mean       5.9360     2.77000       4.2600     1.32600
    # sd         0.5162     0.31380       0.4699     0.19775
    # var        0.2664     0.09847       0.2208     0.03911
    # 
    # $virginica
    #      Sepal.Length Sepal.Width Petal.Length Petal.Width
    # 0%         4.9000      2.2000       4.5000     1.40000
    # 25%        6.2250      2.8000       5.1000     1.80000
    # 50%        6.5000      3.0000       5.5500     2.00000
    # 75%        6.9000      3.1750       5.8750     2.30000
    # 100%       7.9000      3.8000       6.9000     2.50000
    # mean       6.5880      2.9740       5.5520     2.02600
    # sd         0.6359      0.3225       0.5519     0.27465
    # var        0.4043      0.1040       0.3046     0.07543
    

    You will have to decide how you want to use this list or if you want to combine it back into a single data frame, but this should get you started.