Search code examples
rdplyrmeanmissing-data

How to include factor levels when they are missing in some replicas in R?


I have a dataframe df in which I have data about percentages (df$Percentage) of different habitats(df$Habitat) for specific tracks (df$Replica) in different sites (df$Site). As an example:

df <- data.frame(Site = c("A","A","A","A","A","B","B","B","B","B","B"),
                 Replica = c(1,1,1,2,2,1,1,1,2,2,2),
                 Habitat = c("X","Y","Z","X","Y","X","Y","M","X","M","Z"),
                 Porcentage = c(46,38,16,40,60,20,60,20,35,55,10))
df

   Site Replica Habitat Porcentage
1     A       1       X         46
2     A       1       Y         38
3     A       1       Z         16
4     A       2       X         40
5     A       2       Y         60
6     B       1       X         20
7     B       1       Y         60
8     B       1       M         20
9     B       2       X         35
10    B       2       M         55
11    B       2       Z         10

Here, for example, the percentage of habitat X at site A is 46, while there is no habitat M (which is present in site B). The sum of all values for a specific Replica within a site is 100. For example, in site A, for Replica == 2, habitat X and Y sum 100, which means there is no other habitat present in this replica/track.

I want to calculate both the mean percentage (Mean_Percentage) of each habitat per Site and the standard error (SE) of the mean. The mean is calculated using Replica, since for each Site I have repeated measures (= Replica).

df %>% group_by(Site,Habitat) %>% summarise(Mean_Porcentage = mean(Porcentage),SE = sd(Porcentage)/sqrt(length(Porcentage)))

# A tibble: 7 × 4
# Groups:   Site [2]
  Site  Habitat Mean_Porcentage    SE
  <chr> <chr>             <dbl> <dbl>
1 A     X                  43     3  
2 A     Y                  49    11  
3 A     Z                  16    NA  
4 B     M                  37.5  17.5
5 B     X                  27.5   7.5
6 B     Y                  60    NA  
7 B     Z                  10    NA 

The problem is that Mean_Porcentage and thus, SE are not calculated properly. For example, habitat Z is present in Replica == 1 (=16%) but not in Replica == 2 (=0%) in site A. Thus, the mean percentage (Mean_Porcentage) of Z in site A should be 8 (=[16%+0%]/2) and SE should be 8. Also, habitat M is not present in site A but it is in site B, so I want habitat M to appear for site A with a Mean_Percentage of 0.

My desired result should be a dataframe as this:

df2
  Site Habitat Mean_Porcentage     SE
1    A       M             0.0      0
2    A       X            43.0      3
3    A       Y            49.0     11
4    A       Z             8.0      8
5    B       M            37.5   17.5
6    B       X            27.5    7.5
7    B       Y            30.0     30
8    B       Z             5.0      5

Does anyone know how to do it?

Thanks in advance!


Solution

  • You can try xtabs and proportions.

    proportions(xtabs(Porcentage ~ Site + Habitat, df), 1) * 100
    #    Habitat
    #Site    M    X    Y    Z
    #   A  0.0 43.0 49.0  8.0
    #   B 37.5 27.5 30.0  5.0
    
    as.data.frame(proportions(xtabs(Porcentage ~ Site + Habitat, df), 1) * 100)
    #  Site Habitat Freq
    #1    A       M  0.0
    #2    B       M 37.5
    #3    A       X 43.0
    #4    B       X 27.5
    #5    A       Y 49.0
    #6    B       Y 30.0
    #7    A       Z  8.0
    #8    B       Z  5.0
    

    Adding the se as described in the question.

    x <- as.data.frame(proportions(xtabs(Porcentage ~ Site + Habitat, df), 1) * 100)
    x <- merge(x, aggregate(cbind(SE = Porcentage) ~ Site + Habitat, df, \(x) sd(x)/sqrt(length(x))), all.x=TRUE)
    i <- is.na(x$SE)
    x$SE[i] <- x$Freq[i]
    x
    #  Site Habitat Freq   SE
    #1    A       M  0.0  0.0
    #2    A       X 43.0  3.0
    #3    A       Y 49.0 11.0
    #4    A       Z  8.0  8.0
    #5    B       M 37.5 17.5
    #6    B       X 27.5  7.5
    #7    B       Y 30.0 30.0
    #8    B       Z  5.0  5.0