Search code examples
rtidyversedata-manipulationdata-transform

How to create new data rows for each unique value in a column (potentially grouped by other variables) in R


I have data that has the following structure:

data <- data.frame(
  uniqueid = c(1, 1, 2, 2, 3, 3),
  year = c(2010, 2011, 2010, 2011, 2010, 2011),
  agency = c("SZ", "SZ", "SZ", NA, "SZ", "HE"),
  switch = c(0, 0, 0, NA, 0, 1)
)

As you can see, the data is organized by uniqueids appearing in a given year. Keep in mind that for the agency column there can be 13 different unique strings appearing across different uniqueids. I would like the data to look like the following:

data <- data.frame(
  uniqueid = c(1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3),
  year = c(2010, 2010, 2011, 2011, 2010, 2010, 2011, 2011, 2011, 2010, 2010, 2011, 2011),
  agency = c("SZ", "HE", "SZ", "HE", "SZ", "HE", "SZ", NA, "HE", "SZ", "HE", "SZ", "HE"),
  switch = c(0, 0, 0, 0, 0, 0, NA, NA, NA, 0, 0, 0, 1)
)

In this transformation, the rows take on the different unique values of the agency variable, and the switch variable largely maps what it had previously looked like. I'm not really sure how to accomplish this in R, though I would prefer the solution to be within tidyverse. Thank you!

I've been trying something like the following but I don't seem to be getting exactly what I want:

data1 <- data %>%
      pivot_wider(names_from = agency, values_from = lead, names_prefix = "agency_", values_fill = "0") %>%
      gather(key = agency, value = lead, starts_with("agency_")) %>%
      arrange(uniqueid, year, agency)

Solution

  • Here's yet another way:

    Steps:

    1. complete the data (add a row for every variation of id, year, and just the "SZ" and "HE" agencies)
    2. If none of the agency values in each uniqueid and year combination are NA, then change the new row switch values in that group to be 0. Otherwise, keep them NA
    data |>
      complete(uniqueid, year, agency = c("SZ", "HE")) |>
      mutate(switch = ifelse(!any(is.na(agency)) & is.na(switch), 0, switch), .by = c(uniqueid, year))
    

    Output:

    # A tibble: 13 × 4
       uniqueid  year agency switch
          <dbl> <dbl> <chr>   <dbl>
     1        1  2010 HE          0
     2        1  2010 SZ          0
     3        1  2011 HE          0
     4        1  2011 SZ          0
     5        2  2010 HE          0
     6        2  2010 SZ          0
     7        2  2011 HE         NA
     8        2  2011 SZ         NA
     9        3  2010 HE          0
    10        3  2010 SZ          0
    11        3  2011 HE          1
    12        3  2011 SZ          0
    13        2  2011 NA         NA