Search code examples
rtidyversespread

Spread a data.frame with repetitive column


I have a large data.frame that I am trying to spread. A toy example looks like this.

data = data.frame(date = rep(c("2019", "2020"), 2), ticker = c("SPY", "SPY", "MSFT", "MSFT"), value = c(1, 2, 3, 4))

head(data)

 date ticker value
1 2019    SPY     1
2 2020    SPY     2
3 2019   MSFT     3
4 2020   MSFT     4

I would like to spread it so the data.frame looks like this.

spread(data, key = ticker, value = value)
  date MSFT SPY
1 2019    3   1
2 2020    4   2

However, when I do this on my actual data.frame, I get an error.

Error: Each row of output must be identified by a unique combination of keys.
Keys are shared for 18204 rows:
* 30341, 166871
* 30342, 166872
* 30343, 166873
* 30344, 166874
* 30345, 166875
* 30346, 166876
* 30347, 166877
* 30348, 166878
* 30349, 166879
* 30350, 166880
* 30351, 166881
* 30352, 166882

Below is a head and tail of my data.frame

head(df)
ref.date   ticker weeklyReturn
  <date>     <chr>         <dbl>
1 2008-02-01 SPY         NA     
2 2008-02-04 SPY         NA     
3 2008-02-05 SPY         NA     
4 2008-02-06 SPY         NA     
5 2008-02-07 SPY         NA     
6 2008-02-08 SPY         -0.0478

tail(df)
ref.date   ticker weeklyReturn
  <date>     <chr>         <dbl>
1 2020-02-12 MDYV        0.00293
2 2020-02-13 MDYV        0.00917
3 2020-02-14 MDYV        0.0179 
4 2020-02-18 MDYV        0.0107 
5 2020-02-19 MDYV        0.00422
6 2020-02-20 MDYV        0.00347

Solution

  • You can use dplyr and tidyr packages. To get rid of that error, you would have to firstly sum the values for each group.

    data %>%
      group_by(date, ticker) %>%
      summarise(value = sum(value)) %>%
      pivot_wider(names_from = ticker, values_from = value)
    
    # date  MSFT  SPY
    # <fct> <dbl> <dbl>
    #  1 2019  3     1
    #  2 2020  4     2