Search code examples
rggplot2spline

What causes regularize.values() warnings when specifying x-axis labelling rules?


I am plotting several splines for daily data across several years.

I think the spline might be part of the problem due to what it does in the background to create the plotted values.

All works (almost) fine when plotting with the standard plotting code that (I and ChatGPT) came up with.

ggplot(filtered_data, aes(x = DayOfYear, y = Avg_Q, color = as.numeric(Year), group = Year)) +
  geom_line(data = filtered_data %>%
              group_by(Year) %>%
              summarise(x1 = list(spline(Month, Avg_Q, n = 50, method = "natural")[["x"]]),
                        y1 = list(spline(Month, Avg_Q, n = 50, method = "natural")[["y"]])) %>%
              tidyr::unnest(cols = c(x1, y1)),
            aes(x = x1, y = y1), size = 1.1) +
  scale_color_gradientn(colors = c("#1f78b4", "#33a02c", "#fdbf6f", "#ff7f00", "#e31a1c"), 
                        values = seq(0, 1, by = 0.2),  # Adjust the values for distribution
                        guide = "colorbar") +
  labs(subtitle = "Monthly Average of Flow Data", 
       y = "Flow", 
       title = "8-years moving mash flows") +
  theme_minimal() +

plot output

(I say almost because at least the legend shouldn't be like that, since color scale is based on year values ranging from 1928 to 2019).

However, when I try to add a specification for putting month labels at specified locations in my x-axis:

scale_x_continuous(breaks = c(15, 45, 75, 105, 135, 165, 195, 225, 255, 285, 315, 345),
                     labels = c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"))

i get several warnings (186, twice the number of years being plotted, i think it might be relevant) like this:

<warning/rlang_warning> Warning in summarise(): ℹ In argument: x1 = list(spline(Month, Avg_Q, n = 50, method = "natural")[["x"]]). ℹ In group 1: Year = 1927. Caused by warning in regularize.values(): ! collapsing to unique 'x' values

Backtrace: ▆

  1. ├─ggplot2::geom_line(...)
  2. │ └─ggplot2::layer(...)
  3. │ └─ggplot2::fortify(data)
  4. ├─... %>% tidyr::unnest(cols = c(x1, y1))
  5. ├─tidyr::unnest(., cols = c(x1, y1))
  6. ├─dplyr::summarise(...)
  7. └─dplyr:::summarise.grouped_df(...)

I am not really proficient enough in R to understand what's going on and chatGPT doesn't spot the issue either. any help understanding/solving this would be greatly appreciated!

by the way, the filtered_data dataframe is a bit heavy and I'm really unsure how to share it in here. iìI will gladly follow any instructions on this!


Solution

  • If you really want to use splines on monthly data, you will need to first summarize by year and month. The error you are getting is because splines operate on unique x and y values, but you have approximately 30 Avg_Q values every month (one for each day). The function spline emits the warning to tell you it can't handle 30 different y values for each unique x value, and just drops all but the first.

    In any case, I think you're probably making this more complex than you need to. There are a couple of things that you need to fix:

    1. It's obvious from your color scale that your years are not properly represented. It seems that this is because you have converted Year to as.numeric, but Year started out as a factor so the numbers go from 1 to 91 instead of 1930 to 2020. You probably need as.numeric(as.character(Year))
    2. You are calculating the splines based on Month, which is a number 1:12, but DayofYear, which you are assigning to the x axis, is a number 1:366. You are working on numeric months in your geom_line layer, not days of the year. This seems to be causing you some confusion when you come to label your x axis.
    3. What you are doing with splines on monthly data in a geom_line layer might be better done using daily data in a geom_smooth layer using gam with smoothing splines. The code would be less complex and the curves more temporally accurate than using splines to join monthly averages.

    Using geom_smooth, we would get something like this:

    library(tidyverse)
    
    ggplot(filtered_data, aes(x = as.Date('2023-01-01') + DayOfYear, y = Avg_Q, 
                              color = as.numeric(as.character(Year)), 
                              group = Year)) +
      geom_smooth(method = 'gam', se = FALSE) +
      scale_color_gradientn('Year',
                            colors = c("#1f78b4", "#33a02c", "#fdbf6f", 
                                       "#ff7f00", "#e31a1c"),
                            values = seq(0, 1, by = 0.2),  
                            guide = "colorbar") +
      scale_x_date(date_breaks = 'month', date_labels = '%b') +
      labs(subtitle = "Monthly Average of Flow Data", 
           x = 'Month',
           y = "Flow", 
           title = "8-years moving mash flows") +
      theme_minimal()
    

    enter image description here

    If you really want to use monthly splines without the warnings, you could do:

    ggplot(filtered_data, aes(x = Month, y = Avg_Q, 
                              color = as.numeric(as.character(Year)), 
                              group = Year)) +
      geom_line(data = filtered_data %>%
                  group_by(Year, Month) %>%
                  summarise(Avg_Q = mean(Avg_Q, na.rm = TRUE)) %>%
                  ungroup() %>%
                  group_by(Year) %>%
                  summarise(x1 = list(spline(Month, Avg_Q, n = 50, 
                                             method = "natural")[["x"]]),
                            y1 = list(spline(Month, Avg_Q, n = 50, 
                                             method = "natural")[["y"]])) %>%
                  tidyr::unnest(cols = c(x1, y1)),
                aes(x = x1, y = y1), size = 1.1) +
      scale_color_gradientn('Year',
                            colors = c("#1f78b4", "#33a02c", "#fdbf6f",
                                       "#ff7f00", "#e31a1c"), 
                            values = seq(0, 1, by = 0.2), 
                            guide = "colorbar") +
      scale_x_continuous(breaks = 1:12, labels = month.abb) +
      labs(subtitle = "Monthly Average of Flow Data", 
           y = "Flow", 
           title = "8-years moving mash flows") +
      theme_minimal() 
    

    enter image description here


    Data used

    There was no reproducible data in the question. The data set used here was daily temperature data from the Lerwick weather station from 1973 onwards, which is publically available. The columns were calculated and renamed to match the OP's data set.