I am scraping https://www.transparency.org/news/pressreleases/year/2010 to retrieve header and details from each page. But along with header and details a telephone number and a blank string is coming in the retrieved list for every page.
[1] "See our simple, animated definitions of types of corruption and the ways to challenge it."
[2] "Judiciary - Commenting on Justice Bean’s sentencing in the BAE Systems’ Tanzania case, Transparency International UK welcomed the Judge’s stringent remarks concerning BAE Systems’ past conduct."
[3] " "
[4] "+49 30 3438 20 666"
I have tried with following codes but they didn't worked.
html %>% str_remove('+49 30 3438 20 666') %>% str_remove(' ').
How these elements can be removed?
In case you want to drop all lines that start with a + and end with a number:
dd <- c(
"See our simple, animated definitions of types of corruption and the ways to challenge it."
, "Judiciary - Commenting on Justice Bean’s sentencing in the BAE Systems’ Tanzania case, Transparency International UK welcomed the Judge’s stringent remarks concerning BAE Systems’ past conduct."
," "
, "+49 30 3438 20 666")
c <- dd[!grepl("^\\+.*\\d*$",dd)]
You can also use \\s
(one empty space) and \\d{2}
(2 numbers) to have an exact match, to be on the safe side, if all numbers have the same format. Note that you can also use it in str_remove, with the end result beig an empty string. grep instead returns as logical vector that subsets your string.
If you want to delete also all empty lines
dd[!grepl("^\\s*$",dd)]
Note that you can do both at the same time by using "|":
dd[!grepl("^\\+.*\\d*$|^\\s*$",dd)]
You can get familiar with regex here: https://regex101.com/