Search code examples
rdataframedata-manipulation

How can I create a new wide data frame with rows based on all combos of values in two columns?


I have the following dataframe (dput is provided at the bottom of the question):

>df_input
# A tibble: 5 x 4
  category range        samples events
  <chr>    <chr>          <dbl>  <dbl>
1 GroupA   Apr2002         4951  97796
2 GroupA   May2002         9332 195726
3 GroupB   Apr2001         4781  80767
4 GroupB   Oct2001         5677  92890
5 GroupB   OctToNov2001   10296 166037

I would like to create a new dataframe with rows that are a combination of each unique combination of both the category and range columns. For example, category = GroupA and range = Apr2002 would have 3 rows in the output dataframe for each of the three category = Group B rows.

The range column in the input dataframe will always have unique values only.

I would also like to rename the combined output columns for events, samples and range to include the Group names (i.e. range_GroupA, range_GroupB, samples_GroupA, events_GroupA, samples_GroupB, events_GroupB)

I'm struggling with how to create my combined rows from the category column. I'm also struggling to find the right search terms here to find similar questions/answers. The closest I've managed to find so far is Create new rows in data frame based on multiple values of column, but the combo in that question is a bit different that what I'm attempting.

The desired output dataframe is:

> df_output
# A tibble: 6 x 6
  range_GroupA range_GroupB samples_GroupA events_GroupA samples_GroupB events_GroupB
  <chr>        <chr>                 <dbl>         <dbl>          <dbl>         <dbl>
1 Apr2002      Apr2001                4951         97796           4781         80767
2 Apr2002      Oct2001                4951         97796           5677         92890
3 Apr2002      OctToNov2001           4951         97796          10296        166037
4 May2002      Apr2001                9332        195726           4781         80767
5 May2022      Oct2001                9332        195726           5677         92890
6 May2022      OctToNov2001           9332        195726          10296        166037

df_input dataframe:

df_input <- structure(list(category = c("GroupA", "GroupA", "GroupB", "GroupB", 
                                        "GroupB"), range = c("Apr2002", "May2002", "Apr2001", "Oct2001", 
                                                             "OctToNov2001"), samples = c(4951, 9332, 4781, 5677, 10296), 
                           events = c(97796, 195726, 80767, 92890, 166037)), row.names = c(NA, 
                                                                                           -5L), class = c("tbl_df", "tbl", "data.frame"))

df_output dataframe

df_output <- structure(list(range_GroupA = c("Apr2002", "Apr2002", "Apr2002", 
                                             "May2002", "May2022", "May2022"), range_GroupB = c("Apr2001", 
                                                                                                "Oct2001", "OctToNov2001", "Apr2001", "Oct2001", "OctToNov2001"
                                             ), samples_GroupA = c(4951, 4951, 4951, 9332, 9332, 9332), events_GroupA = c(97796, 
                                                                                                                          97796, 97796, 195726, 195726, 195726), samples_GroupB = c(4781, 
                                                                                                                                                                                    5677, 10296, 4781, 5677, 10296), events_GroupB = c(80767, 92890, 
                                                                                                                                                                                                                                       166037, 80767, 92890, 166037)), row.names = c(NA, -6L), class = c("tbl_df", 
                                                                                                                                                                                                                                                                                                         "tbl", "data.frame"))

Solution

  • I think we can get your result with a filtered cartesian join:

    library(dplyr)
    left_join(
      df_input %>% mutate(dummy = 1),
      df_input %>% mutate(dummy = 1), by = "dummy") %>%
      filter(category.x < category.y)
    

    You'll recognize all the numbers you're looking for, but with different header names. We can rename them manually, but that's no fun. See below for renamed version.

    # A tibble: 6 × 9
      category.x range.x samples.x events.x dummy category.y range.y      samples.y events.y
      <chr>      <chr>       <dbl>    <dbl> <dbl> <chr>      <chr>            <dbl>    <dbl>
    1 GroupA     Apr2002      4951    97796     1 GroupB     Apr2001           4781    80767
    2 GroupA     Apr2002      4951    97796     1 GroupB     Oct2001           5677    92890
    3 GroupA     Apr2002      4951    97796     1 GroupB     OctToNov2001     10296   166037
    4 GroupA     May2002      9332   195726     1 GroupB     Apr2001           4781    80767
    5 GroupA     May2002      9332   195726     1 GroupB     Oct2001           5677    92890
    6 GroupA     May2002      9332   195726     1 GroupB     OctToNov2001     10296   166037
    

    EDIT: This seems to do it with the renaming:

    left_join(
      df_input %>% rename_with(~paste0(.,"_GroupA")) %>% mutate(dummy = 1),
      df_input %>% rename_with(~paste0(.,"_GroupB")) %>% mutate(dummy = 1), 
      by = "dummy") %>%
      filter(category_GroupA < category_GroupB) %>%
      select(-category_GroupA, -dummy, -category_GroupB) %>%
      relocate(range_GroupB, .after = 1)
    
    
    # A tibble: 6 × 6
      range_GroupA range_GroupB samples_GroupA events_GroupA samples_GroupB events_GroupB
      <chr>        <chr>                 <dbl>         <dbl>          <dbl>         <dbl>
    1 Apr2002      Apr2001                4951         97796           4781         80767
    2 Apr2002      Oct2001                4951         97796           5677         92890
    3 Apr2002      OctToNov2001           4951         97796          10296        166037
    4 May2002      Apr2001                9332        195726           4781         80767
    5 May2002      Oct2001                9332        195726           5677         92890
    6 May2002      OctToNov2001           9332        195726          10296        166037