Search code examples
duplicatesapplescriptbibtex

Find and remove duplicates in a Bibtex (BibDesk) using AppleScript


I have more than a thousand duplicates in my Bibtex library. The duplicates have no identical Citation Keys. They have identical titles. I have tried both BibDesk and Jabref to remove the duplicates. They are however don't manage to find them all; not even half of them.

I find one promising AppleScript in here: http://se-server.ethz.ch/staff/af/bibdesk/

But, since I am total beginner with AppleScript, I couldn't adopt it to my needs.

Here is the AppleScript:

on run {}
    CleanupDuplicates()
end run


-- IMPORTANT NOTE: The following routine is an identical copy as contained in files 'Cleanup Duplicates.scpt' and 'Fix PDF and URL Links.scpt'. Make sure the two copies are always kept identical.
on CleanupDuplicates()
    set theBibDeskDocu to document 1 of application "BibDesk"
    tell document 1 of application "BibDesk"
        -- get all publications sorted by cite key ensuring that in any set of publications with the same cite key the youngest comes first and the oldest, typically the only one of the set that is still member of any static groups, comes last. To retain static group memberships we have to ensure that such "membership info" is copied from the last to the first publication of any set of publications with the same cite key (see vars 'aPub', 'prevPub', 'youngestPub').
        set thePubs to (sort (get publications) by "Cite Key" subsort by "Date-Added" without ascending)
        set theDupes to {}
        set prevCiteKey to missing value
        set prevPub to missing value
        set youngestPub to missing value
        repeat with aPub in thePubs
            set aCiteKey to cite key of aPub
            ignoring case
                if aCiteKey is prevCiteKey then
                    set end of theDupes to aPub
                    -- we fix the static group membership redundantly in cases where aPub is also merely an obsolete duplicate, since we have possibly not yet advanced to the end of the set with the same cite key. But this is unavoidable with this algorithm looping simply through all publications. The end result will be that youngestPub (first in set of publications with same cite key) will be member of all static groups of the publications in the set (unification). The latter should be no big issue, since typically in multiple sets of publications it is only the last publication that matters. If this should be an issue, then we would need to first delete all static group membership info in 'youngestPub' in case we encounter a 3rd, or 4th etc. same cite key in 'aPub', and copy only those of 'aPub'. However, for the sake of efficiency I wish not to support this behavior.
                    my fixGroupMembership(theBibDeskDocu, aCiteKey, aPub, youngestPub)
                else
                    -- remember in 'youngestPub' possible candiate for a new set of publications with the same cite key
                    set youngestPub to aPub
                end if
            end ignoring
            set prevCiteKey to aCiteKey
            set prevPub to aPub
        end repeat
        repeat with aPub in theDupes
            delete aPub
        end repeat
    end tell
end CleanupDuplicates


on fixGroupMembership(theBibDeskDocu, theCiteKey, oldPub, newPub)
    tell application "BibDesk"
        tell theBibDeskDocu
            set thePubsGroups to (get static groups whose publications contains oldPub)
            if (count of thePubsGroups) is greater than 0 then
                repeat with aGroup in thePubsGroups
                    add newPub to aGroup
                end repeat
            end if
        end tell
    end tell
end fixGroupMembership

So, what I want is to be able to find the duplicates by Title: and to be able to delete the Oldest (that means, by modification date).

Can you guys help me modify this script please?


Solution

  • Use this script:

    on run {}
        CleanupDuplicates()
    end run
    
    on CleanupDuplicates()
        script o
            property thePubs : {}
        end script
        tell document 1 of application "BibDesk"
            -- get all publications sorted by Title (same titles are sorted by Date-Modified, descending)
            set o's thePubs to (sort (get publications) by "Title" subsort by "Date-Modified" without ascending)
            set tc to count o's thePubs
            set i to 1
    
            repeat while i < tc
                set theTitle to title of item i of o's thePubs
                repeat with j from (i + 1) to tc -- check the next title
                    considering case --  match the case, *** remove this if you want to ignore the case
                        if (title of item j of o's thePubs) is not theTitle then exit repeat ---  not the same title, so exit this loop ---
                    end considering
    
                    delete item j of o's thePubs --- the title is the same, so remove this publication (a duplicate, oldest modification date) ---
                end repeat
                set i to j
            end repeat
        end tell
    end CleanupDuplicates
    

    Update

    Caveat: some publications have no modification date.

    To sort publications by modification date properly, you need to define the Date-Modified field on publications that have not been modified.

    An AppleScript can't change the date property of a publication in BibDesk because these dates are read-only.

    Here's a solution:

    1. Close the document in BibDesk.
    2. Open the ".bib" file in the "TextWrangler" application.
    3. Run this script:

    --

    -- This script add the modification date on publications that have no "Date-Modified", the date will be that of the "Date-Added".
    -- so, open a ".bib" file in "TextWrangler", and run this script
    tell application "TextWrangler"
        tell text document 1
            select line 1 -- to start the search at the beginning of the document
    
            repeat -- until not found
                -- search "Date-Added" + (a blank line or the end of the document)
                set r to find "(?s)^\\tDate-Added = {.+?(^$|\\z)" searching in it options {search mode:grep, wrap around:false} with selecting match
                if found of r then
                    if "Date-Modified = {" is not in (found text of r) then -- the Date-Modified field is not in this publication
                        set x to startLine of found object of r
                        set t to text 12 thru -1 of (get contents of line x) -- get the value of the Date-Added field --> " = {2016.09.10 03:34}," as example
                        add suffix (line x) suffix "\\n\\tDate-Modified" & t -- append (a line break + a tab + "Date-Modified" + the value of the Date-Added) to this line
                    end if
                else
                    exit repeat -- no found or end of the document
                end if
            end repeat
        end tell
    end tell
    
    1. From TextWrangler, Save or "Save as..." and close the document.
    2. Open the ".bib" file in BibDesk.