I'm using R to extract domain names for a column of HTML pages. I created a function "domain" to do so. It seems to work fine, until it hits pages that came in as "mailto: person@example.com". These are obviously the links for emails. I still wanted to incorporate these into my dataset, but the error I get is: "Error in strsplit(gsub("http://|https://|www\.", "", x), "/")[[c(1, 1)]] : subscript out of bounds"
How can I modify this code to get around the "mailto" pages?
This is my function
domain <- function(x) strsplit(gsub("http://|https://|www\\.","", x),"/")[[c(1,1)]]
This is my command
mainpagelevel3$url <- sapply(mainpagelevel3$url, domain)
I ran this code on a set of urls that did not include a "mailto:" page and it worked just fine, so I think this must be where it's getting stuck. I don't mind if it resulted in "person@example.com" or stays as is.
We could try to write an if
condition to check for strings which start with "mailto"
and have "@"
in them (this can be made more strict if needed). So the function might look like
domain <- function(x) {
if(grepl("^mailto:.*@.*", x)) x
else strsplit(gsub("http://|https://|www\\.","", x),"/")[[c(1,1)]]
}
and then use sapply
as usual
mainpagelevel3$url <- sapply(mainpagelevel3$url, domain)