Search code examples
rregextext-mining

Extracting all different options of references from pdf document in R with regex (multiple options/capture groups?)


I am trying to clean some pdf documents for text analysis. I am trying to grab all the references on the text and remove them. My problem is, that there are so many options to cite... My documents are split up into single lines. I have a working regex, that only captures the standard format

a) Author (year), something . "Author, firstname, someone, else (1996), something: Analysis, Paris.\r"

I want option a,

b) Author (year(character)), something .

  "Author, firstname, someone, else (1996a), something: Analysis, Paris.\r"

c) Author (forthcoming), something .

  "Author, firstname, someone, else (forthcoming), something: Analysis, Paris.\r"

d) Author/s (eds.) (year), ....

  "Author, firstname, someone, else (eds.) (1996), something: Analysis, Paris.\r"

e) Author (n.d.), ....

  "Author, firstname, someone, else (n.d.), something: Analysis, Paris.\r"

I have found all of those in my documents... There might be options I have not found yet, so if you have examples or something that grabs that as well, I'm grateful for every it of help.

The working code is the following:

   [ ]*[A-Z].*\([0-9]{4}\),[[:space:]][“A-Z]

My latest try is this:

   [ ]*[A-Z].*(\([a-z]{3,4}\.?\))?(\([0-9]{4}[a-z]?\))?(\(forthcoming\))?,[[:space:]][“A-Z]

I tried to make as many pieces optional as I could, but now it grabs too much.

I expect a list of all the References the regex finds, if possible with all the options. At the moment it grabs not enough (first case) or too much (second case).


Solution

  • My latest try is this:

       [ ]*[A-Z].*(\([a-z]{3,4}\.?\))?(\([0-9]{4}[a-z]?\))?(\(forthcoming\))?,[[:space:]][“A-Z]
    

    I tried to make as many pieces optional as I could, but now it grabs too much.

    You almost perfectly made up the three option pieces, but since you made them all optional, the expression matches even if none of them is present. Better use the alternation operator |, which requires one subexpression piece to match, i. e. instead of X?Y?Z? write (X|Y|Z); this makes:

      [ ]*[A-Z].*(\([.a-z]{3,4}\.?\)|\([0-9]{4}[a-z]?\)|\(forthcoming\)),[[:space:]][“A-Z]
    

    (Note that I changed the first [a-z] to [.a-z] in order to also cover the (n.d.) case.)