Search code examples
rregexfiltertidyversespam

How to filter a table based on email address suffix


I have a table of over 100K names and addresses . I would like to filter the table to keep only those emails I think are not spam.

i have for example addresses as such

[email protected]
[email protected]
[email protected]

I would like to filter now those addresses that have only digit before the @ symbol as well as those emails which have only digit after the @, but before the suffix .com.

I know I can extract them using str_split and grepl, but I can't fit them into a filter query to remove them from the table.

pattern <- "[email protected]"
str_split(pattern, '@') # this will split the address based on the sumbol

str_split(string = str_split(pattern, '@')[[1]][2], pattern = "\\.") # this will split the doamin name based on the dot separating the suffix from the numbers.

as.numeric(str_split(string = str_split(pattern, '@')[[1]][2], pattern = "\\.")[[1]][1]) # This for example will check if the string extracted above contains only numbers, if not it will return NA

But how do I combine this in a tidyverse query?

thanks

P.S. I know this is a farfetched question, but is there some kind a spam filter for email address one can use within R?


Solution

  • I think this pattern should help you identify the spam email as per your condition.

    ^\\d+@|@\\d+\\.com
    

    To use it in filter you may use grepl or str_detect from stringr.

    data %>% filter(grepl('^\\d+@|@\\d+\\.com', email))
    

    To get rows which are not spam negate the condition using !.

    data %>% filter(!grepl('^\\d+@|@\\d+\\.com', email))
    

    Example :

    x <- c('[email protected]', '[email protected]', '[email protected]', '[email protected]')
    grepl('^\\d+@|@\\d+\\.com', x)
    #[1]  TRUE  TRUE  TRUE FALSE