Search code examples
rweb-scrapingnlp

Remove specific string or blank member from character vector


I am scraping https://www.transparency.org/news/pressreleases/year/2010 to retrieve header and details from each page. But along with header and details a telephone number and a blank string is coming in the retrieved list for every page.

[1] "See our simple, animated definitions of types of corruption and the ways to challenge it."
[2] "Judiciary - Commenting on Justice Bean’s sentencing in the BAE Systems’ Tanzania case, Transparency International UK welcomed the Judge’s stringent remarks concerning BAE Systems’ past conduct."
[3] " "
[4] "+49 30 3438 20 666"

I have tried with following codes but they didn't worked.

html %>% str_remove('+49 30 3438 20 666') %>% str_remove(' ').

How these elements can be removed?


Solution

  • In case you want to drop all lines that start with a + and end with a number:

    dd <- c(
     "See our simple, animated definitions of types of corruption and the ways to challenge it."
    , "Judiciary - Commenting on Justice Bean’s sentencing in the BAE Systems’ Tanzania case, Transparency International UK welcomed the Judge’s stringent remarks concerning BAE Systems’ past conduct."
    ," "
    , "+49 30 3438 20 666")
    
    c <- dd[!grepl("^\\+.*\\d*$",dd)]
    

    You can also use \\s (one empty space) and \\d{2} (2 numbers) to have an exact match, to be on the safe side, if all numbers have the same format. Note that you can also use it in str_remove, with the end result beig an empty string. grep instead returns as logical vector that subsets your string.

    If you want to delete also all empty lines

    dd[!grepl("^\\s*$",dd)]
    

    Note that you can do both at the same time by using "|":

    dd[!grepl("^\\+.*\\d*$|^\\s*$",dd)]
    

    You can get familiar with regex here: https://regex101.com/