I am using the pdf tool to extract data from the scanned file by transforming to png first. Since the pdf tool read from png, there were some punctuations showing up for no reason. I can remove most of them except for "|".
Here is my data:
c("| January 2,310,501 2,342,654 + 14%", "| February 2,221,036 2,316,278 + 4.3%", )
I want my data can be like that:
c("January 2,310,501 2,342,654 + 14%", "February 2,221,036 2,316,278 + 4.3%",)
As you can see from the picture attached, "|" has changed the structure of my data and I cannot simply read the data from the second column. What I want is to remove the "|" element at all. Then the rest elements can move forward. You can also find the file attached. Thank you for your help.
You could use lapply
to remove elements which are "|"
.
lapply(test2, function(x) x[x != '|'])
#[[1]]
#[1] "January" "test"
#[[2]]
#[1] "February" "2, 602,33"
Similarly, using map
in purrr
purrr::map(test2, ~.x[.x != '|'])
For the updated data we can use gsub
test <- trimws(gsub('\\|', '', test))
test
# [1] "January 2,310,501 2,342,654 + 14%" "February 2,221,036 2,316,278 + 4.3%"
# [3] "March 2,602,503 2,571,661 ( -1.2% )" "April 2,471,788 2,485,989 i 0.6%"
# [5] "May 2,418,547 2,512,922 + 3.9%" "June 2,412,882 2,430,232 + 0.7%"
# [7] "July 2,462,907 2,535,594 + 3.0%" "August 2,526,211 2,638,753 + 4.5%"
# [9] "September 2,434,132 2,480,466 * + 1.9%" "October 2,552,215 2,642,990 * + 3.6%"
#[11] "November 2,306,106 2,428,806 + 5.3%" "December _ 2,283,294 2,250,016 ( -1.5% )"
data
test2 <- list(c('|', 'January', 'test'), c('February', '2, 602,33', '|'))