Search code examples
rtesseractpdftools

How to Remove "|" Without Leaving Space from the List in R


I am using the pdf tool to extract data from the scanned file by transforming to png first. Since the pdf tool read from png, there were some punctuations showing up for no reason. I can remove most of them except for "|".

Here is my data:

c("| January 2,310,501 2,342,654 + 14%", "| February 2,221,036 2,316,278 + 4.3%", )

I want my data can be like that:

c("January 2,310,501 2,342,654 + 14%", "February 2,221,036 2,316,278 + 4.3%",)

As you can see from the picture attached, "|" has changed the structure of my data and I cannot simply read the data from the second column. What I want is to remove the "|" element at all. Then the rest elements can move forward. You can also find the file attached. Thank you for your help.


Solution

  • You could use lapply to remove elements which are "|".

    lapply(test2, function(x) x[x != '|'])
    
    #[[1]]
    #[1] "January" "test"   
    
    #[[2]]
    #[1] "February"  "2, 602,33"
    

    Similarly, using map in purrr

    purrr::map(test2,  ~.x[.x != '|'])
    

    For the updated data we can use gsub

    test <- trimws(gsub('\\|', '', test))
    test
    
    # [1] "January 2,310,501 2,342,654 + 14%"        "February 2,221,036 2,316,278 + 4.3%"     
    # [3] "March 2,602,503 2,571,661 ( -1.2% )"      "April 2,471,788 2,485,989 i 0.6%"        
    # [5] "May 2,418,547 2,512,922 + 3.9%"           "June 2,412,882 2,430,232 + 0.7%"         
    # [7] "July 2,462,907 2,535,594 + 3.0%"          "August 2,526,211 2,638,753 + 4.5%"       
    # [9] "September 2,434,132 2,480,466 * + 1.9%"   "October 2,552,215 2,642,990 * + 3.6%"    
    #[11] "November 2,306,106 2,428,806 + 5.3%"      "December _ 2,283,294 2,250,016 ( -1.5% )"
    

    data

    test2 <- list(c('|', 'January', 'test'), c('February', '2, 602,33', '|'))