Search code examples
rdplyrtidyverse

Error: Why is dplyr not reading a factor level in my data frame when using the functions group_by() and summarise() in R


Issue:

I have a large data frame (388 x 729) and I am trying to calculate the mean for each month (over 14 years) which is a factor using a numeric column called 'Daffodil_Bulbs'.

I created a vector so the months are outputted in the right order, but when I run my R-code using the package dplyr, it is not reading the month 'July', and replacing this with an 'NA' (See the R code output below).

I've checked my data frame and there are no NAs or missing values

Does anyone know how to fix this issue?

R-code:

#Create a vector so the months are in the right order 
month_levels = c('January', 'February', 'March', 'April', 'May', 'June', 'July',
                 'August', 'September', 'October', 'November', 'December')

#Use dplyr to subset the data to find the average group size per month 
Df_Average_Month <- MyDf %>% dplyr::mutate(Month=ordered(Month, levels=month_levels)) %>%
                                    group_by(Month) %>%
                                    summarise(Average_Daffodiles = mean(Daffodile_Bulbs, na.rm = TRUE))

Output from the vector for month

> month_levels = c('January', 'February', 'March', 'April', 'May', 'June', 'July',
+                  'August', 'September', 'October', 'November', 'December')

Dataframe structure

$ Month                              : Factor w/ 18 levels "April","April ",..: 9 8 8 8 8 8 8 8 8 1 ...
$ Daffodil Bulbs                     : num  0 3 0 3 2 1 0 0 0 0 ...

R-code Output

# A tibble: 12 × 2
   Month     Average_Daffodils
   <ord>                  <dbl>
 1 January                11.4 
 2 February               11.3 
 3 March                  12.4 
 4 April                   8.67
 5 May                    12.6 
 6 June                   12.5 
 7 August                  9.67
 8 September              12.7 
 9 October                 9.92
10 November                9.19
11 December               10.8 
12 NA                     16.3 

Solution

  • It seems like dplyr might be skipping factor levels that have no corresponding data in your group. Make sure to check if all levels are represented in your dataset. Consider using droplevels() to clean up any unused factor levels. Also, check for NA values that could affect your grouping.