Search code examples
rhtml-parsing

R function to parse returning error in strsplit "subscript out of bounds"


I'm using R to extract domain names for a column of HTML pages. I created a function "domain" to do so. It seems to work fine, until it hits pages that came in as "mailto: person@example.com". These are obviously the links for emails. I still wanted to incorporate these into my dataset, but the error I get is: "Error in strsplit(gsub("http://|https://|www\.", "", x), "/")[[c(1, 1)]] : subscript out of bounds"

How can I modify this code to get around the "mailto" pages?

This is my function

domain <- function(x) strsplit(gsub("http://|https://|www\\.","", x),"/")[[c(1,1)]]

This is my command

mainpagelevel3$url <- sapply(mainpagelevel3$url, domain)

I ran this code on a set of urls that did not include a "mailto:" page and it worked just fine, so I think this must be where it's getting stuck. I don't mind if it resulted in "person@example.com" or stays as is.


Solution

  • We could try to write an if condition to check for strings which start with "mailto" and have "@" in them (this can be made more strict if needed). So the function might look like

    domain <- function(x) {
       if(grepl("^mailto:.*@.*", x)) x 
          else strsplit(gsub("http://|https://|www\\.","", x),"/")[[c(1,1)]]
    }
    

    and then use sapply as usual

    mainpagelevel3$url <- sapply(mainpagelevel3$url, domain)