Search code examples
rtranspose

Pivot_wider with Known Duplicates that Need to Stay


I have data with participants that have multiple rows assigned to their id and I am trying to get them all on one line. This is because the contact type column has their phones and emails on each line. There are multiple phones and emails without a well organized labels system. Some include people with multiples of "Email", but the emails are distinct. So you can see how when I try to pivot, pivot_wider causes an error bc two "Email" columns isn't correct. I either would love thoughts on a work around for this like code to automatically tell the function to relabel it or, more likely, r code to clean up the 2,000 or so rows that have this issue. I imagine it requires code that says if there is a second or third reoccurrence of a label for rows that have the same id code columns to add a 2 or 3 to the end of the label. Am I thinking of it correctly? Is there an easier way?

This is how it is set up. The code I have tried finds more duplicates than the 2,000 or so with the contact type issue like there being multiple Peters, but different Ids getting caught up in it. Even if you can't provide code. Am I thinking about it the right way? Is this the only way to solve this issue?

dat1 <- data.frame(Participant = c("Peter", "Peter", "Peter", "Peter", "Peter", "Peter",
                                   "Carol", "Carol", "Carol", "Carol", "Carol", "Carol"),
                   GroupID = c(1, 1, 2, 2, 3, 3, 3, 3, 4, 5, 5, 5),
                   Phone.Type = c("Email", "Email", "Email", "Phone", "Office", "Home","Home", "Home", "Other", "Website", "Fax", "Office Fax"), GroupTheme = c("[email protected]", "[email protected]", "[email protected]", 1234567890, 0987654321, 587290164389, 36237862873, 34384738923, 43432894274, "website.com", 32489789432, 23467832648937))

Solution

  • The tl;dr of this answer is that if your real data is anything like the sample data, it's most likely better to keep it in the longer format.

    To illustrate, here is what I would do to make the data wide:

    > dat1 |> pivot_wider(names_from = Phone.Type, values_from = GroupTheme)
    # A tibble: 6 × 10
      Participant GroupID Email     Phone     Office    Home   Other  Website Fax   
      <chr>         <dbl> <list>    <list>    <list>    <list> <list> <list>  <list>
    1 Peter             1 <chr [2]> <NULL>    <NULL>    <NULL> <NULL> <NULL>  <NULL>
    2 Peter             2 <chr [1]> <chr [1]> <NULL>    <NULL> <NULL> <NULL>  <NULL>
    3 Peter             3 <NULL>    <NULL>    <chr [1]> <chr>  <NULL> <NULL>  <NULL>
    4 Carol             3 <NULL>    <NULL>    <NULL>    <chr>  <NULL> <NULL>  <NULL>
    5 Carol             4 <NULL>    <NULL>    <NULL>    <NULL> <chr>  <NULL>  <NULL>
    6 Carol             5 <NULL>    <NULL>    <NULL>    <NULL> <NULL> <chr>   <chr> 
    # ℹ 1 more variable: `Office Fax` <list>
    Warning message:
    Values from `GroupTheme` are not uniquely identified; output will contain list-cols.
    

    Note a couple of things: first, the warning at the end. It's because each row is a person, and Peter in Group 1 has two emails listed, so we get list columns. Second, the large number of NULL values- it's because the sample data is very sparse.

    Doing things this way isn't actually the worst way of organising the data- it could be good if the real data has more info for each person- the actual worst way would be to create separate columns for each potential thing - say, for the sake of argument, if one person in your dataset listed 9 different phone numbers, every row of your dataset would now have a Phone1, Phone2, Phone3 all the way through to Phone9 column, with most of the values being NULL, as in the example. As you can imagine, this takes up a lot of space, is slow, and is hard to use- if you, say, want to check a certain phone number, to find the person associated with it, it now requires checking ten different columns (which is slow and ugly and difficult to code).

    So to summarise, here are your options:

    1. Keep the data as is (if the data is sparse, then probably the best option)
    2. Make it wider like above (and deal with list columns) - probably best if your data is dense and you are okay with list columns (doing x %in% y instead of x == y)
    3. Make extra columns for extra values (really bad idea, avoid)
    4. (If possible) remove the extra values (e.g. if you information on when the data was collected, then removing every email except the last one collected (it sounds like this is actually a data problem, as opposed to a coding problem - the dataset itself sounds like it's a bit of a mess, so there aren't any easy choices)