Search code examples
rduplicatesquantedabibliography

Remove all instances of duplicates in bibliographic dataset in R


I have two bibliographic datasets A & B (.bib files, WoS export, full record & cited references). Both of them contain relevant and irrelevant results. The first dataset A has been cleaned so that I have the relevant results A(r) and irrelevant results A(i) as two different datasets (.bib files). The second dataset B encompasses my first dataset A completely. visualisation of my two datasets

Goal: I am looking for a way to remove the irrelevant results A(i), which I have already identified in my first dataset, from my second dataset B.

Approach: If I were to merge the datasets B & A(i) I could trace the irrelevant results A(i) in B by using a remove duplicate function since A(i) would occur twice in B. However, this would only remove the duplicates of A(i) and not all instances of A(i).

Functions to remove duplicats:

package revtools

matches <- find_duplicates(data, match_variable = "title")

data_unique <- extract_unique_references(data, matches)

package bibliometrix

duplicatedMatching(M, Field = "TI", tol = 0.95)

•Q1: Is there a way to remove all instances of duplicates (the duplicates and the originals) identified through a find/remove duplicate function?

•Q2: Is there a better way for removing A(i) from B? i.e. remove all instances of duplicates in a dataset

•Q3: More generally asking: can I search for a larger amount of specific bibliographic data in my dataset (a list of papers) and remove it from that dataset?

Thank you so much for your help!


Solution

  • You can use match to find identical title in two data sets.

    #remove Ai from B
    B[-match(unique(Ai$title), B$title),]
    #  title misc
    #1     a    X
    #2     b    X
    #5     e    X
    #7     g    X
    
    #remove Ai and Ar from B
    B[-match(unique(c(Ai$title, Ar$title)), B$title),]
    #  title misc
    #7     g    X
    

    Data:

    Ar <- data.frame(title=c("a", "b", "e"), misc="X", stringsAsFactors = FALSE)
    Ai <- data.frame(title=c("d", "c", "f"), misc="X", stringsAsFactors = FALSE)
    B <- data.frame(title=letters[1:7], misc="X", stringsAsFactors = FALSE)