Search code examples
rdplyrtidyverse

Is there an elegant way to handle changing number of rows within tidyverse?


In Tidyverse there are limitations concerning the row number resulting from some data processing. Most prominent, mutate expects that the row number equals to the original data set. For example, if we want density values from a variable x we could do:

library(magrittr)
df %>%
 dplyr::mutate(dx= density(x)$x,
               dy= density(x)$y)

This results in an error saying something like "Caused by error:! dx must be size 100 or 1, not 512.".

But in many situations the number of rows changes during data processing! Is there any elegant way to incorporate this into the tidyverse coding?

All I can come up with so far is using {} where row number changes. See following example where I make interpolation for x on y (which also changes row number):

library(magrittr)
df %>%
# Some data processing where row number stays the same
  dplyr::mutate(x2= x*x,
                id= 1:dplyr::n()) %>%
# Row number changes! So I use code inside {}
  {time_interpolate_for <- seq(min(.$x), max(.$x), 1)
  data.frame(x= time_interpolate_for,
             y= approx(.$x, .$y, xout= time_interpolate_for)$y)

  } %>%
# Going on with the new data and processing it so that row number remains the same
  dplyr::mutate(xy_diff= x - y)

Is there a better way to do this?

Data used:

# Generate data
set.seed(1)
x <- sample(1:999, 100); y <- .5*x + rnorm(100)
df <- data.frame(x, y)

Solution

  • You can use summarise or reframe (now the recommended method) for such a task. But see the note:

    set.seed(1)
    x <- sample(1:999, 100); y <- .5*x + rnorm(100)
    df <- data.frame(x, y)
    
    library(magrittr)
    df %>%
      # Some data processing where row number stays the same
      dplyr::mutate(x2= x*x, id= 1:dplyr::n()) %>%
      dplyr::reframe(
        x.0 = seq(min(x), max(x), 1),
        y.0 = approx(x, y, xout= x.0)$y) %>%
      # Going on with the new data and processing it so that row number remains the same
      dplyr::mutate(xy_diff= x.0 - y.0) 
    
    

    NOTE

    1. summarise also work but since 1.1.0 there is a deprecation warning, so pay attention of it:
    ...
      dplyr::summarise(
        x.0 = seq(min(x), max(x), 1),
        y.0 = approx(x, y, xout= x.0)$y) %>%
    ...
    

    Warning: Returning more (or less) than 1 row per summarise() group was deprecated in dplyr 1.1.0. ℹ Please use reframe() instead. ℹ When switching from summarise() to reframe(), remember that reframe() always returns an ungrouped data frame and adjust accordingly.

    1. I changed the names of variables x, y in summarise to x.0 and y.0 because dplyr verbs will see the new defined x instead of the x of the previous step (args are recursive).
    2. Since R >= 4.1 you can use the native pipe |> instead of magrittr %>%