Search code examples
rdataframedplyrduplicatesconsolidation

How to collapse and aggregate rows with character / long string data


I am working on a large dataset with lots of text data that needs to be consolidated. There are supposedly unique cases / observations, but they do have duplicates. The catch is, that sometimes the duplicates cases provide complimentary and new information. Therefore, I would like to collapse / merge the cases based on criteria.

I have here a very small sample dataset, that illustrates the idea. Please note that in reality the varText is often more than 1000 characters long.

varID represents the targeted unique observation

varCat represents one categorial data, sometimes it contains NA, sometimes it complements an observation (in the actual case, I have about 10 of these)

    varID <- c('a', 'b', 'c', 'd', 'e', 'a', 'b', 'c', 'd', 'c', 'd', 'e', 'a', 'z')
    varText <- c('This is a long text', 'This is also a long text', 
                 'This is short', 'This is another unique long text', 
                 'Blabla1', 'Blabla2', 'Blabla3', 'Blabla4', 'Blabla5', 'Blabla6', 'Blabla7', 
                 'Blabla8', 'This is also a long blabla', 'This case is perfectly fine')
    varCat <- c('CatA', 'CatB', NA, 'CatC', 'CatA', NA, NA, 'CatC', 'CatA', 'CatB', NA, 'CatC', NA, 'CatF')

    df <- data.frame(varID, varText, varCat, stringsAsFactors = FALSE)

Sample df:

       varID                          varText varCat
    1      a              This is a long text   CatA
    2      b         This is also a long text   CatB
    3      c                    This is short   <NA>
    4      d This is another unique long text   CatC
    5      e                          Blabla1   CatA
    6      a                          Blabla2   <NA>
    7      b                          Blabla3   <NA>
    8      c                          Blabla4   CatC
    9      d                          Blabla5   CatA
    10     c                          Blabla6   CatB
    11     d                          Blabla7   <NA>
    12     e                          Blabla8   CatC
    13     a       This is also a long blabla   <NA>
    14     z      This case is perfectly fine   CatF

First I identify all cases that have duplicates:

df <- df %>% add_count(varID, name = 'dupe_varID')

Then I also want to compare the text based on its length:

df$text_length <- stringr::str_length(df$varText)

Finally, I create a new dataframe which only has the duplicated cases. And I think I can use group_by from dplyr. But I don't know how to proceed from here.

# filter all duplicated cases into new df sort ???
df2 <- df %>% filter(dupe_varID > 1) %>% group_by(varID) %>% arrange(desc(text_length), varCat) 

I would like to have the following result:

  • The longest varText should be kept
  • NA values are replaced with non NAs
  • Duplicates are dropped
  • If there is a conflict in varCat the case with the longest text provides varCat
 1 a     This is also a long blabla       CatA
 2 b     This is also a long text         CatB
 3 c     This is short                    CatC
 4 d     This is another unique long text CatC
 5 e     Blabla1                          CatA
14 z     This case is perfectly fine      CatF

Solution

  • An option is to group by 'varID', then fill the NA elements with the adjacent non-NA element and slice the row with the max number of characters (nchar) in 'varText'

    library(dplyr)
    library(tidyr)
    df %>% 
       group_by(varID) %>%
       fill(varCat, .direction = 'downup') %>% 
       slice(which.max(nchar(varText)))
    # A tibble: 6 x 3
    # Groups:   varID [6]
    #  varID varText                          varCat
    #  <chr> <chr>                            <chr> 
    #1 a     This is also a long blabla       CatA  
    #2 b     This is also a long text         CatB  
    #3 c     This is short                    CatC  
    #4 d     This is another unique long text CatC  
    #5 e     Blabla1                          CatA  
    #6 z     This case is perfectly fine      CatF