Search code examples
rpdfdata-extraction

Unable to do forloop in R


Hi I have a number of PDF files saved in one folder. Every PDF files has number of currency value starting with $ , i want to extract the first currency value in each file , i am able to do it for single file but not when looping through number of files where i will get the the output from each file Like $xxx,xxx,xxx $xxx,xxx,xxx $xxx,xxx,xxx Code when i am using for single file

''''

text_data <- pdf_text('Sample2.pdf')
text_collapsed_data <- paste0(text_data, collapse = '\n')
k=str_extract_all(text_collapsed_data, "\\$\\d+(?:,\\d+)(?:,\\d+)")[[1]]
k[1] 

'''' Code when i am using to loop for multiple flies

''''

files <- list.files(pattern = "pdf$")
for (i in 1:length(files)){
  print(i)
  pdf_text(paste(str_extract_all("~filepath/desktop",files[i], "\\$\\d+(?:,\\d+)(?:,\\d+)")[[1]]))

}

''''

getting error subscript out of bounds Let me where can i go wrong


Solution

  • You can do:

    myextr <- function(pdffile) {
      text_data <- pdf_text(pdffile)
      text_collapsed_data <- paste0(text_data, collapse = '\n')
      k=str_extract_all(text_collapsed_data, "\\$\\d+(?:,\\d+)(?:,\\d+)")[[1]]
      k[1] 
    }
    files <- list.files(pattern = "pdf$")
    sapply(files, myextr)