Ive noticed tidyr(0.4.0)
sorts value columns when using spread
, where as tidyr(0.3.1)
returns the value columns in the order they were prior to a gather
.
Reproducible example 1:
library(dplyr)
library(tidyr)
# tidyr 0.3.1
dat<-data.frame(name=rep(c("A","B"),5),sam.id=rep(c(1,2),5),
frac=sample(c(0.05,0.1,0.2),10,replace=TRUE),
Aspecies=rnorm(10),Bspecies=rnorm(10),Zspecies=rnorm(10))
I aggregate the species values by both sam.id
and frac
(the proportion of the sample measured) i.e a multiple gather
.
dt.agg.0.3.1 <- gather(dat,key,value,-name,-sam.id) %>%
group_by(name,key) %>%
summarise(Total=sum(value)) %>% spread(key,Total) %>%
mutate(all=rowSums(.[,3:5]))
The last part of the pipe computes a simple total of all species using mutate
. So that:
head(dt.agg.0.3.1)
Source: local data frame [2 x 6]
name frac Aspecies Bspecies Zspecies all
(fctr) (dbl) (dbl) (dbl) (dbl) (dbl)
1 A 0.85 -2.675137 -0.03287276 1.016791 -1.858010
2 B 0.40 4.194904 1.50561762 -2.738543 6.100522
Reproducible example 2:
library(tidyr)
# 0.4.0
dt.agg.0.4.0 <- gather(dat,key,value,-name,-sam.id) %>%
group_by(name,key) %>%
summarise(Total=sum(value)) %>% spread(key,Total)
head(dt.agg.0.4.0)
Source: local data frame [2 x 5]
Groups: name [2]
name Aspecies Bspecies frac Zspecies
(fctr) (dbl) (dbl) (dbl) (dbl)
1 A -2.675137 -0.03287276 0.85 1.016791
2 B 4.194904 1.50561762 0.40 -2.738543
One can see how the ordering of the value columns has been changed (alphabetically), which makes additional data-pipeline steps using mutate
for example troublesome.
dt.agg.0.4.0.mutated <- gather(dat,key,value,-name,-sam.id) %>%
group_by(name,key) %>% summarise(Total=sum(value)) %>%
spread(key,Total) %>% mutate(all=rowSums(.[,2:5]))
Throws the error;
Error: incompatible size (2), expecting 1 (the group size) or 1
Is there a way for tidyr(0.4.0)
to spread
back out in the order of the gather
?
Or will have to gather
(and summarise
) twice -- once for each key-value pair?
We can use ungroup
after the spread
and grep
the columns that needs to used for rowSums
(tidyr_0.4.0
).
gather(dat, key, value, -name, sam.id) %>%
group_by(name, key) %>%
summarise(Total=sum(value)) %>%
spread(key, Total) %>%
ungroup() %>%
mutate(all= rowSums(.[grep('species', names(.))]))
# name Aspecies Bspecies frac sam.id Zspecies all
# (fctr) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl)
#1 A 5.795958 -0.4769954 0.4 5 3.965114 9.284077
#2 B 2.475395 -1.4858969 0.5 10 1.045175 2.034674
If we need to get the order of the columns as that occurs in the 'key' column, then we may need to convert the 'key' to factor
class with levels
specified. In this case, we can use the position index in rowSums
.
gather(dat,key,value,-name,-sam.id) %>%
mutate(key= factor(key, levels=unique(key))) %>%
group_by(name, key) %>%
summarise(Total = sum(value)) %>%
spread(key, Total) %>%
ungroup() %>%
mutate(all = rowSums(.[3:5]))
#Source: local data frame [2 x 6]
# name frac Aspecies Bspecies Zspecies all
# (fctr) (dbl) (dbl) (dbl) (dbl) (dbl)
#1 A 0.4 5.795958 -0.4769954 3.965114 9.284077
#2 B 0.5 2.475395 -1.4858969 1.045175 2.034674
If we look at the str
after the spread
step i.e.
res <- gather(dat, key, value, -name, sam.id) %>%
group_by(name, key) %>%
summarise(Total=sum(value)) %>%
spread(key, Total)
str(res)
#Classes ‘grouped_df’, ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 6 variables:
#...
One of the class
is 'grouped_df', which somehow creates the problem.
str(res %>% ungroup)
#Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 6 variables:
NOTE: The values are different from the OP's post as no set.seed
was specified.