Search code examples
rpivottidyr

data wrangling: how can i filter rows with multiple values? or other ways to handle nested values


when i use pivot_wider() on my data, there are some numeric values that are duplicates, forcing conversion to a list with nested values.

here's an example:

dfx <- data.frame("Arb_PersonID" = c(1,1,2,2,2,2), 
                  "LabComponentName" = c("2ch", "4ch", "2ch", "4ch", "2ch", "4ch"), 
                  "LabResult" = c(56, 55, 50, 54, 56, 60))

class(dfx1$`2ch`)

this shows:

> tibble(dfx)
# A tibble: 6 × 3
  Arb_PersonID LabComponentName LabResult
         <dbl> <chr>                <dbl>
1            1 2ch                     56
2            1 4ch                     55
3            2 2ch                     50
4            2 4ch                     54
5            2 2ch                     56
6            2 4ch                     60

> class(dfx$LabResult)
[1] "numeric"

Then pivot_wider:

dfx1 <- dfx %>%
  pivot_wider(names_from = LabComponentName, values_from = LabResult)



  Arb_PersonID `2ch`     `4ch`    
             1  56         55
             2  c(50,56)  c(54,60)

> class(dfx1$`2ch`)
[1] "list"

I'd like to use the average of the 2 values (ie 53 instead of c(50,56)) and keep the class numeric.

Thanks in advance for any help. I'm using a very large dataset of 14,000 rows and these duplicates are about 1/500. Manually reviewing and correcting is quite painful, but not impossible.


Solution

  • There's a handy values_fn option within the pivot_wider call which will allow you do simply do the mean whenever there are multiple values:

    library(tidyverse)
    
    dfx <- data.frame("Arb_PersonID" = c(1,1,2,2,2,2), 
                      "LabComponentName" = c("2ch", "4ch", "2ch", "4ch", "2ch", "4ch"), 
                      "LabResult" = c(56, 55, 50, 54, 56, 60))
    
    dfx %>%
      pivot_wider(names_from = LabComponentName, values_from = LabResult,
                  values_fn = mean)
    #> # A tibble: 2 × 3
    #>   Arb_PersonID `2ch` `4ch`
    #>          <dbl> <dbl> <dbl>
    #> 1            1    56    55
    #> 2            2    53    57