Search code examples
rgroup-bydplyrscalenormalize

scale() not compatible with grouped tbl_df


I was trying to normalise values in a grouped tbl_df and I was surprised to see it did not work. It does, however function as expected with dataframes and regular (i.e. ungrouped) tbl_dfs

Example:

> df <- data.frame(year = c(1999, 2002, 2005, 2008),
                   LA = c(3931.120, 4273.710, 4601.415, 4101.321), 
                   NY = c(346.82000, 134.30882, 130.43038, 88.27546))
> df
  year       LA        NY
1 1999 3931.120 346.82000
2 2002 4273.710 134.30882
3 2005 4601.415 130.43038
4 2008 4101.321  88.27546

These all work fine:

> df %>% mutate_each_(funs(scale), vars = c('LA', 'NY'))
      year         LA         NY
1 1999 -1.0334913  1.4757715
2 2002  0.1635942 -0.3490598
3 2005  1.3086682 -0.3823639
4 2008 -0.4387712 -0.7443478

> df_tbl <- tbl_df(df)
> df_tbl %>% mutate_each_(funs(scale), vars = c('LA', 'NY'))
Source: local data frame [4 x 3]

   year         LA         NY
  (dbl)      (dbl)      (dbl)
1  1999 -1.0334913  1.4757715
2  2002  0.1635942 -0.3490598
3  2005  1.3086682 -0.3823639
4  2008 -0.4387712 -0.7443478

But once groupped, the function fails:

> df.grouped <- df %>% group_by(year)
> df.grouped %>% mutate_each_(funs(scale), vars = c('LA', 'NY'))
Source: local data frame [4 x 3]
Groups: year [4]

   year    LA    NY
  (dbl) (dbl) (dbl)
1  1999    NA    NA
2  2002    NA    NA
3  2005   NaN   NaN
4  2008    NA    NA

df.grouped %>% mutate_each(funs(scale)) # Gives the same result

I've done some research and it is clear that tbl_df

never simplifies (drops), so always returns data.frame.

But it doesn't explain why an ungrouped tbl_df is fine but a grouped isn't, especially since the ?mutate_each mentions that

... vars: Variables to include/exclude in mutate/summarise. You can use same specifications as in select. If missing, defaults to all non-grouping variables.

QUESTION

  • Is the only way around the problem to add ungroup to the pipe like below?

    df.grouped %>% ungroup %>% mutate_each_(funs(scale), vars = c('LA', 'NY'))
    # OR
    df.grouped %>% ungroup %>% mutate_each(funs(scale))
    

Solution

  • As Alistaire has pointed out above, the reason why this didn't work was this:

    You can't scale a single value usefully, which is what you're trying to do when it's grouped.

    Thank you, Alistaire!