Search code examples
rdataframedplyrduplicatesedge-list

How to extract unique and double records by row while keeping original row order using R?


Original (see Update below)

I'm a newbie to R and currently working with collaboration data in the form of an edge list with 32 columns and around 200.000 rows in the following representation:

1  A    A    A    B    C    A
2  A    B    B    B    C    A
3  A    B    C    C    C    C
4  B    A    B    A    B    C

A, B, C represent the countries of the in the publication participating researchers' institutions. In the real data set "A" is e.g. a country name such as "England" or "China".

I want to keep unique records (A) and double records (A A) but remove triplets (A A A) and more occurence of the same record in a row. Collaborations are assigned to publications by ID so the order should stay the same to allow for later analysis. However, order within rows does not matter.

So, it should look like this eventually.

1  A   A    B    C    
2  A   B    B    C    A
3  A   B    C    C        
4  B   A    B    A    C

So far, I've tried a few things based on 1, 2, 3 and triplicatedfrom the tuple package.

df <- data.frame(CTR1 = c("A", "A", "A", "B"), CTR2 = c("A", "B", "B", "A"), CTR3 = c("A", "B", "B", "A"), CTR4 = c("B", "B", "C", "A"), CTR5 = c("C", "C", "C", "B"), CTR6 = c("A", "A", "C", "C"), ID = c(1,2,3,4))

# remember the ID

n <-df$ID

# transpose df (data frame)

dt <- as.data.frame(t(df[, -1]))
colnames(dt) <- n

library(tuple)

dt[!(triplicated(dt) | triplicated(dt, fromLast= TRUE)), ]

# Create new transposed variable

df2 <- as.data.frame(t(df))

However, I delete the complete rows using dt[!(triplicated(dt) | triplicated(dt, fromLast= TRUE)), ] rather than just deleting the specific records that present a surplus by row what leads me to the following 4x4 table ...

   CTR1 CTR3 CTR4 CTR5                   CTR1 CTR2 CTR3 CTR4 CTR5 CTR6
1  A    A    B    C                   1  A    A    B    C        
2  A    B    B    C    rather than    2  A    B    B    C    A   
3  A    B    C    C                   3  A    B    C    C        
4  B    A    A    B                   4  B    A    B    A    C    

I've also taken a look at dplyr and the responses here and here but couldn't figure out a fitting approach so far.

Working Solution for original problem description

library(tidyr)
library(dplyr)

countrydf %>% 
  unite(concat,CTR1:CTR6, sep = "") %>% 
  mutate(concat = gsub("([a-zA-Z1-9])\\1{2,}", "\\1\\1-", concat)) %>% 
  separate(concat, paste0("CTR", 1:6), sep = "(?<=.)", remove = TRUE)

Edit1: Adjusted description for clarification: In the real data set "A" is e.g. a country name such as "England" or "China".

Edit2: Adding a more accurated reproducible example.

Update

Added a more accurated reproducible example, incorporated correct answer for original problem description (see below):

countrydf <- data.frame(ID = c(1,2,3,4), 
CTR1 = c("England", "England", "England", "China"),
CTR2 = c("England", "China", "China", "England"),
CTR3 = c("England", "China", "China", "England"),
CTR4 = c("China", "China", "USA", "England"),
CTR5 = c("USA", "USA", "USA", "China"),
CTR6 = c("England", "England", "USA", "USA"))

Aspired outcome

     CTR1    CTR2    CTR3   CTR4  CTR5  CTR6      ID
1    England England        China USA              1
2    England China   China        USA   England    2
3    England China   China  USA   USA              3
4    China   England England      China USA        4

Solution

  • We can use tidyr functions (unite and separate) and with the help of achieve the desired output.

    This should get very close to what you need.

    library(tidyr)
    library(dplyr)
    
    df1 %>% 
      unite(concat,CTR1:CTR6, sep = "") %>% 
      mutate(concat = gsub("([a-zA-Z1-9])\\1{2,}", "\\1\\1-", concat)) %>% 
      separate(concat, paste0("CTR", 1:6), sep = "(?<=.)", remove = TRUE)
    
    #>   ID CTR1 CTR2 CTR3 CTR4 CTR5 CTR6
    #> 1  1    A    A    -    B    C    A
    #> 2  2    A    B    B    -    C    A
    #> 3  3    A    B    B    C    C    -
    #> 4  4    B    A    A    -    B    C
    

    Update:

    Thanks to @IceCreamToucan for the function:

    library(tidyr)
    library(dplyr)
    
    ICT_fn <- function(x){
    xsplit <- strsplit(x, '')[[1]]
    xsplit[data.table::rowid(xsplit) >= 3] <- '-'
    paste(xsplit, collapse = '')}
    
    df1 %>% 
      unite(concat,CTR1:CTR6, sep = "") %>% 
      rowwise() %>% 
      mutate(concat = ICT_fn(concat)) %>% 
      separate(concat, paste0("CTR", 1:6), sep = "(?<=.)", remove = TRUE)
    
    #> # A tibble: 4 x 7
    #>      ID CTR1  CTR2  CTR3  CTR4  CTR5  CTR6 
    #>   <dbl> <chr> <chr> <chr> <chr> <chr> <chr>
    #> 1     1 A     A     -     B     C     -    
    #> 2     2 A     B     B     -     C     A    
    #> 3     3 A     B     B     C     C     -    
    #> 4     4 B     A     A     -     B     C
    

    This updated solution, with minimal tweaks, works for name of countries as well; look below:

    df2 <- data.frame(ID = c(1,2,3,4),
                             CTR1 = c("England", "England", "England", "France"), 
                             CTR2 = c("England", "France", "France", "England"), 
                             CTR3 = c("England", "France", "France", "England"), 
                             CTR4 = c("France", "France", "Germany", "England"),
                             CTR5 = c("Germany", "Germany", "Germany", "France"), 
                             CTR6 = c("England", "England", "Germany", "Germany"))
    
    library(tidyr)
    library(dplyr)
    
    ICT_fn <- function(x){ #Credits to IceCreamToucan
      xsplit <- strsplit(x, ',')[[1]]
      xsplit[data.table::rowid(xsplit) >= 3] <- '-'
      paste(xsplit, collapse = ',')}
    
    df2 %>% 
      unite(concat,CTR1:CTR6, sep = ",") %>% 
      rowwise() %>% 
      mutate(concat = ICT_fn(concat)) %>% 
      separate(concat, paste0("CTR", 1:6), sep = ",", remove = TRUE)
    
    #> # A tibble: 4 x 7
    #>      ID CTR1    CTR2    CTR3    CTR4    CTR5    CTR6   
    #>   <dbl> <chr>   <chr>   <chr>   <chr>   <chr>   <chr>  
    #> 1     1 England England -       France  Germany -      
    #> 2     2 England France  France  -       Germany England
    #> 3     3 England France  France  Germany Germany -      
    #> 4     4 France  England England -       France  Germany