Search code examples
rdplyrknitrdata-cleaningdcast

summarizing a dataframe and adding column with mean ± SD


I have a dataframe of many variables (soil properties) for 11 legumes in 2 different locations. first few columns of the data is shown below.

 SPECIES   LOCATION   pH   NO3  NH4    P Organic_C      K   Cu    Mn   Zn   BD X.Sand X.Silt X.Clay
1    C. comosa    Gauteng 5.40  8.24 1.35 1.10      0.95  94.40 3.36 84.40 4.72 1.45   68.0     12      9
2    C. comosa    Gauteng 5.25  8.36 1.37 1.20      0.99  94.87 3.39 84.87 4.77 1.36   76.0     16     13
3    C. comosa    Gauteng 5.55  8.19 1.32 1.11      0.94  94.01 3.35 84.01 4.68 1.54   78.0     14     14
4    C. comosa Mpumalanga 5.84  4.05 3.46 3.04      1.55 130.40 0.28 25.43 2.00 1.66   73.6      9     10
5    C. comosa Mpumalanga 5.49  4.45 3.48 3.09      1.53 131.36 0.27 25.35 2.12 1.45   76.5     11     16
6    C. comosa Mpumalanga 6.19  4.43 3.44 3.04      1.58 129.95 0.29 25.45 2.14 1.87   74.9     13     16
7   C. distans    Gauteng 5.48  8.88 1.96 3.33      0.99 130.24 0.99 40.01 3.94 1.55   70.0      8     11
8   C. distans    Gauteng 5.29  8.54 1.99 3.28      0.99 130.28 0.95 40.25 3.89 1.48   79.0     12     15
9   C. distans    Gauteng 5.67  8.63 1.93 3.39      1.02 130.30 0.98 40.12 3.97 1.62   79.0     10     16
10  C. distans Mpumalanga 5.61  6.02 2.65 4.45      2.58 163.25 1.79 53.11 6.11 1.68   72.0      8     10
11  C. distans Mpumalanga 5.43  6.58 2.55 4.49      2.59 163.55 1.78 52.89 6.04 1.63   78.0     15     14
12  C. distans Mpumalanga 5.79  6.24 2.59 4.41      2.59 163.27 1.75 53.03 6.19 1.73   75.0     16     12
13 E. cordatum    Gauteng 4.38 16.29 5.76 4.77      3.25 175.38 1.11 35.87 8.54 1.53   33.0      9     40
14 E. cordatum    Gauteng 4.05 16.15 5.63 4.73      3.29 175.90 1.23 34.34 8.61 1.42   45.0     13     50
15 E. cordatum    Gauteng 4.71 15.89 5.99 4.80      3.25 174.54 1.19 36.44 8.58 1.64   42.0     14     54

for each location, i want to summarize the data such that the soil properties are in the first column while the legumes are the other columns and each value reported as the mean ± SD. something like the table below.

enter image description here

i am thinking dcast but i am not sure how to get the values as mean ± SD


Solution

  • In Base R this gives you the exact output, But you lose the ability to (easily) extract the numbers afterwards. There are better ways of storing the data if you want to do more with it afterwards.

    as.data.frame(t(aggregate(. ~ LOCATION, df[,-1], function(x) paste(round(mean(x),2),"±", round(sd(x),2)))))
    
                          V1             V2
    LOCATION         Gauteng     Mpumalanga
    pH           5.09 ± 0.57    5.72 ± 0.28
    NO3         11.02 ± 3.83     5.29 ± 1.1
    NH4          3.03 ± 2.09    3.03 ± 0.47
    P            3.08 ± 1.58    3.75 ± 0.76
    Organic_C    1.74 ± 1.14    2.07 ± 0.57
    K         133.32 ± 35.08 146.96 ± 17.96
    Cu           1.84 ± 1.15    1.03 ± 0.82
    Mn         53.37 ± 23.39  39.21 ± 15.12
    Zn           5.74 ± 2.15     4.1 ± 2.21
    BD           1.51 ± 0.09    1.67 ± 0.14
    X.Sand     63.33 ± 18.18      75 ± 2.11
    X.Silt          12 ± 2.6      12 ± 3.22
    X.Clay     24.67 ± 17.99      13 ± 2.76
    

    Edit:

    If you want the results for all species/locations you can use

    as.data.frame(t(aggregate(. ~ SPECIES + LOCATION, df, function(x) paste(round(mean(x),2),"±", round(sd(x),2)))))
    
                        V1            V2            V3            V4            V5
    SPECIES      C. comosa    C. distans   E. cordatum     C. comosa    C. distans
    LOCATION       Gauteng       Gauteng       Gauteng    Mpumalanga    Mpumalanga
    pH          5.4 ± 0.15   5.48 ± 0.19   4.38 ± 0.33   5.84 ± 0.35   5.61 ± 0.18
    NO3        8.26 ± 0.09   8.68 ± 0.18   16.11 ± 0.2   4.31 ± 0.23   6.28 ± 0.28
    NH4        1.35 ± 0.03   1.96 ± 0.03   5.79 ± 0.18   3.46 ± 0.02    2.6 ± 0.05
    P          1.14 ± 0.06   3.33 ± 0.06   4.77 ± 0.04   3.06 ± 0.03   4.45 ± 0.04
    Organic_C  0.96 ± 0.03      1 ± 0.02   3.26 ± 0.02   1.55 ± 0.03   2.59 ± 0.01
    K         94.43 ± 0.43 130.27 ± 0.03 175.27 ± 0.69 130.57 ± 0.72 163.36 ± 0.17
    Cu         3.37 ± 0.02   0.97 ± 0.02   1.18 ± 0.06   0.28 ± 0.01   1.77 ± 0.02
    Mn        84.43 ± 0.43  40.13 ± 0.12  35.55 ± 1.09  25.41 ± 0.05  53.01 ± 0.11
    Zn         4.72 ± 0.05   3.93 ± 0.04   8.58 ± 0.04   2.09 ± 0.08   6.11 ± 0.08
    BD         1.45 ± 0.09   1.55 ± 0.07   1.53 ± 0.11   1.66 ± 0.21   1.68 ± 0.05
    X.Sand       74 ± 5.29      76 ± 5.2     40 ± 6.24     75 ± 1.45        75 ± 3
    X.Silt          14 ± 2        10 ± 2     12 ± 2.65        11 ± 2     13 ± 4.36
    X.Clay       12 ± 2.65     14 ± 2.65     48 ± 7.21     14 ± 3.46        12 ± 2
    

    And if you really only want one location you can use.

    as.data.frame(t(aggregate(. ~ SPECIES,df[df$LOCATION == "Gauteng",-2], function(x) paste(round(mean(x),2),"±", round(sd(x),2)))))
    
                        V1            V2            V3
    SPECIES      C. comosa    C. distans   E. cordatum
    pH          5.4 ± 0.15   5.48 ± 0.19   4.38 ± 0.33
    NO3        8.26 ± 0.09   8.68 ± 0.18   16.11 ± 0.2
    NH4        1.35 ± 0.03   1.96 ± 0.03   5.79 ± 0.18
    P          1.14 ± 0.06   3.33 ± 0.06   4.77 ± 0.04
    Organic_C  0.96 ± 0.03      1 ± 0.02   3.26 ± 0.02
    K         94.43 ± 0.43 130.27 ± 0.03 175.27 ± 0.69
    Cu         3.37 ± 0.02   0.97 ± 0.02   1.18 ± 0.06
    Mn        84.43 ± 0.43  40.13 ± 0.12  35.55 ± 1.09
    Zn         4.72 ± 0.05   3.93 ± 0.04   8.58 ± 0.04
    BD         1.45 ± 0.09   1.55 ± 0.07   1.53 ± 0.11
    X.Sand       74 ± 5.29      76 ± 5.2     40 ± 6.24
    X.Silt          14 ± 2        10 ± 2     12 ± 2.65
    X.Clay       12 ± 2.65     14 ± 2.65     48 ± 7.21