Search code examples
rregextext-mining

Extracting links from pdfs in R with a regex


I am trying to clean a list of pdfs of links. I want to include this in my cleaning function and therefore use regexes. And yes, I spend more time than I like to admit googling and browsing though questions here. My pdfs are split into lines, so it is not one consecutive string. I have a piece of code that gives me only one link as result (even though there should be many). All other options I tried included a lot of text I want to keep in my dataset.

I have tried multiple options outside my function but they will not run on texts, only on examples.

I want to catch everything from the www to the first white space after all the things that come after the .org or .html or whatever (e.g. /questions/ask/somethingelse

I tried simulating some things

w <- "www.smthing.org/knowledge/school/principal.\r"
z <- "www.oecd.de\r"
x <- "www.bla.pdfwerr\r .irgendwas" # should not catch that, too many characters after the . 
m <-  "           www.cognitioninstitute.org/index.php/Publications/ 
 bla test smth 
  .gtw, www.stmthing-else.html.\r"
n <- "decoy"


l <- list(w,z,x,m,n)

regmatches(l, regexpr("w{3}\\.[a-z]*\\.[a-z]{2,4}.*?[[:space:]]", l))

My current working state also only catches the first occurence in that particular line, instead stopping at the space (line m in my example) and then including the next link as well.


Solution

  • You may use

    regmatches(l, gregexpr("w{3}\\.\\S*\\b", l))
    

    The gregexpr function will let you extract all occurrences of the pattern.

    Note that most users prefer spelling out www instead of using w{3}.

    Pattern details

    • w{3} - three w chars
    • \\. - a dot
    • \\S* - zero or more non-whitespace chars
    • \\b - word boundary.