I am working on a large dataset with lots of text data that needs to be consolidated. There are supposedly unique cases / observations, but they do have duplicates. The catch is, that sometimes the duplicates cases provide complimentary and new information. Therefore, I would like to collapse / merge the cases based on criteria.
I have here a very small sample dataset, that illustrates the idea. Please note that in reality the varText
is often more than 1000 characters long.
varID
represents the targeted unique observation
varCat
represents one categorial data, sometimes it contains NA, sometimes it complements an observation (in the actual case, I have about 10 of these)
varID <- c('a', 'b', 'c', 'd', 'e', 'a', 'b', 'c', 'd', 'c', 'd', 'e', 'a', 'z')
varText <- c('This is a long text', 'This is also a long text',
'This is short', 'This is another unique long text',
'Blabla1', 'Blabla2', 'Blabla3', 'Blabla4', 'Blabla5', 'Blabla6', 'Blabla7',
'Blabla8', 'This is also a long blabla', 'This case is perfectly fine')
varCat <- c('CatA', 'CatB', NA, 'CatC', 'CatA', NA, NA, 'CatC', 'CatA', 'CatB', NA, 'CatC', NA, 'CatF')
df <- data.frame(varID, varText, varCat, stringsAsFactors = FALSE)
Sample df:
varID varText varCat
1 a This is a long text CatA
2 b This is also a long text CatB
3 c This is short <NA>
4 d This is another unique long text CatC
5 e Blabla1 CatA
6 a Blabla2 <NA>
7 b Blabla3 <NA>
8 c Blabla4 CatC
9 d Blabla5 CatA
10 c Blabla6 CatB
11 d Blabla7 <NA>
12 e Blabla8 CatC
13 a This is also a long blabla <NA>
14 z This case is perfectly fine CatF
First I identify all cases that have duplicates:
df <- df %>% add_count(varID, name = 'dupe_varID')
Then I also want to compare the text based on its length:
df$text_length <- stringr::str_length(df$varText)
Finally, I create a new dataframe which only has the duplicated cases. And I think I can use group_by
from dplyr
. But I don't know how to proceed from here.
# filter all duplicated cases into new df sort ???
df2 <- df %>% filter(dupe_varID > 1) %>% group_by(varID) %>% arrange(desc(text_length), varCat)
I would like to have the following result:
varText
should be keptvarCat
the case with the longest text provides varCat
1 a This is also a long blabla CatA
2 b This is also a long text CatB
3 c This is short CatC
4 d This is another unique long text CatC
5 e Blabla1 CatA
14 z This case is perfectly fine CatF
An option is to group by 'varID', then fill
the NA
elements with the adjacent non-NA element and slice
the row with the max
number of characters (nchar
) in 'varText'
library(dplyr)
library(tidyr)
df %>%
group_by(varID) %>%
fill(varCat, .direction = 'downup') %>%
slice(which.max(nchar(varText)))
# A tibble: 6 x 3
# Groups: varID [6]
# varID varText varCat
# <chr> <chr> <chr>
#1 a This is also a long blabla CatA
#2 b This is also a long text CatB
#3 c This is short CatC
#4 d This is another unique long text CatC
#5 e Blabla1 CatA
#6 z This case is perfectly fine CatF