Search code examples
rkey-valuetidyrspread

tidyr spread sorting inconsistencies


Ive noticed tidyr(0.4.0) sorts value columns when using spread, where as tidyr(0.3.1) returns the value columns in the order they were prior to a gather.

Reproducible example 1:

library(dplyr)
library(tidyr)
# tidyr 0.3.1

dat<-data.frame(name=rep(c("A","B"),5),sam.id=rep(c(1,2),5),
      frac=sample(c(0.05,0.1,0.2),10,replace=TRUE),
      Aspecies=rnorm(10),Bspecies=rnorm(10),Zspecies=rnorm(10))

I aggregate the species values by both sam.id and frac(the proportion of the sample measured) i.e a multiple gather.

dt.agg.0.3.1 <- gather(dat,key,value,-name,-sam.id) %>% 
                group_by(name,key) %>% 
                summarise(Total=sum(value)) %>% spread(key,Total) %>%
                mutate(all=rowSums(.[,3:5]))

The last part of the pipe computes a simple total of all species using mutate. So that:

head(dt.agg.0.3.1)

Source: local data frame [2 x 6]

name  frac  Aspecies    Bspecies  Zspecies       all
(fctr) (dbl)     (dbl)       (dbl)     (dbl)     (dbl)
1      A  0.85 -2.675137 -0.03287276  1.016791 -1.858010
2      B  0.40  4.194904  1.50561762 -2.738543  6.100522

Reproducible example 2:

library(tidyr)
# 0.4.0

dt.agg.0.4.0 <- gather(dat,key,value,-name,-sam.id) %>%
                group_by(name,key) %>% 
                summarise(Total=sum(value)) %>% spread(key,Total)

head(dt.agg.0.4.0)

Source: local data frame [2 x 5]
Groups: name [2]

name  Aspecies    Bspecies  frac  Zspecies
(fctr)     (dbl)       (dbl) (dbl)     (dbl)
1      A -2.675137 -0.03287276  0.85  1.016791
2      B  4.194904  1.50561762  0.40 -2.738543

One can see how the ordering of the value columns has been changed (alphabetically), which makes additional data-pipeline steps using mutate for example troublesome.

dt.agg.0.4.0.mutated <- gather(dat,key,value,-name,-sam.id) %>%
                        group_by(name,key) %>% summarise(Total=sum(value)) %>%
                        spread(key,Total) %>% mutate(all=rowSums(.[,2:5]))

Throws the error;

Error: incompatible size (2), expecting 1 (the group size) or 1

Is there a way for tidyr(0.4.0) to spread back out in the order of the gather?

Or will have to gather (and summarise) twice -- once for each key-value pair?


Solution

  • We can use ungroup after the spread and grep the columns that needs to used for rowSums (tidyr_0.4.0).

    gather(dat, key, value, -name, sam.id) %>% 
               group_by(name, key) %>% 
               summarise(Total=sum(value)) %>%
               spread(key, Total) %>%
               ungroup() %>%
               mutate(all= rowSums(.[grep('species', names(.))]))
    #     name Aspecies   Bspecies  frac sam.id Zspecies      all
    #   (fctr)    (dbl)      (dbl) (dbl)  (dbl)    (dbl)    (dbl)
    #1      A 5.795958 -0.4769954   0.4      5 3.965114 9.284077
    #2      B 2.475395 -1.4858969   0.5     10 1.045175 2.034674
    

    If we need to get the order of the columns as that occurs in the 'key' column, then we may need to convert the 'key' to factor class with levels specified. In this case, we can use the position index in rowSums.

    gather(dat,key,value,-name,-sam.id) %>% 
          mutate(key= factor(key, levels=unique(key))) %>% 
          group_by(name, key) %>%
          summarise(Total = sum(value)) %>% 
          spread(key, Total) %>% 
          ungroup() %>%
          mutate(all = rowSums(.[3:5]))
    #Source: local data frame [2 x 6]
    
    #    name  frac Aspecies   Bspecies Zspecies      all
    #  (fctr) (dbl)    (dbl)      (dbl)    (dbl)    (dbl)
    #1      A   0.4 5.795958 -0.4769954 3.965114 9.284077
    #2      B   0.5 2.475395 -1.4858969 1.045175 2.034674
    

    If we look at the str after the spread step i.e.

    res <- gather(dat, key, value, -name, sam.id) %>% 
                       group_by(name, key) %>% 
                       summarise(Total=sum(value)) %>% 
                       spread(key, Total)
    
    str(res)
    #Classes ‘grouped_df’, ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of  6 variables:
    #...
    

    One of the class is 'grouped_df', which somehow creates the problem.

    str(res %>% ungroup)  
    #Classes ‘tbl_df’, ‘tbl’ and 'data.frame':       2 obs. of  6 variables:
    

    NOTE: The values are different from the OP's post as no set.seed was specified.