Search code examples
rlapplystringrsubscript

R - subscript out of bounds with stringer and grep in lapply function


My goal is to extract strings of digits with fixed length (8 digits) from multiple text files in several folders after matching a certain pattern.

I have spent the whole day in order build a lapply-function out of it, so all the files (up to 20) in the subdirectories can be processed automatically. I fail though. The code of the workflow is executable, however due to my poor knowledge of R limited to only one file.

In between the lines with numbers there is one string per file, each different, which I want to extract. The output of the string extraction should be stored folderwise.

The strings have the following structure: String[one or two digits]_[eight digits] . For example, String1_20220101 or String12_20220108. I want to extract the part after the underscore.

The text files are structured in this way with each over 10000 rows.

Example for file 1:

     X1  X2
1 1000 100
2 1050 100
3 1100 100
4 1150 100
5 1200 100
6 String1_20220101
7 1250 100
8 1300 100
9 1350 100
10 1400 100

x1 <- list(c(seq(1000,1400, by=50)))
[1] 1000 1050 1100 1150 1200 1250 1300 1350 1400

x2 <- list(c(rep(100, 9)))
[1] 100 100 100 100 100 100 100 100 100

File 2:

   x1     x2
1 2000  200
2 3000  200
3 4000  200
4 5000  200
5 6000  200
6 7000  200
7 String12_20220108
8 8000  200
9 9000  200
10 10000 200


x1 <- list(c(seq(1000,10000,by=1000)))
[1]  1000  2000  3000  4000  5000  6000  7000  8000  9000 10000

x2 <- list(c(rep(200, 9)))
[1] 200 200 200 200 200 200 200 200 200


The files lie in numbered folders and derive their name from the folder number and belong to one observation.

My code for folder 1:

library(stringr)

Folderno1 <- list.files(path = "path/to/file/1/",
pattern = "*.txt",
full.names = TRUE)

FUN <- function(Folder1) {
folder_input <- readLines(Folderno1)
string <- grep("String[0-9]_", folder_input, value = TRUE)
output <- capture.output(as.numeric(str_extract_all(string, "(?<=[0-9]{1,2}_)[0-9]+")[[1]]))
write(output, file="/pathtofile/String1.tex")
}

lapply(Folderno1, FUN)

Error in str_extract_all(string, "(?<=[0-9]{1,2}_)[0-9]+")[[1]] : 
subscript out of bounds

The above error message appears. The file String1.tex can be overwritten despite the error message, but only with one result:

[1] 20220101

The rerun with debug shows:

function (x) 
.Internal(withVisible(x))

Could you please guide me how the workflow can be successfully changed, so every file can be processed? I can not get my head around it.

Thank you.


Solution

  • You are overwriting the same file every time (write(output, file="/pathtofile/String1.tex")) in the function. Probably, you want to create a new .tex file for every .txt file.

    From the error message I think there are certain files which do not have the pattern that we are looking for (String[0-9]_). String[0-9]_ will not work with 2 digit numbers like String12_20220108. I have changed it to use String[0-9]+_. To be on safer side I have also added an if condition to check the length of output.

    Try this solution -

    Folderno1 <- list.files(path = "path/to/file/1/",
                            pattern = "*.txt",
                            full.names = TRUE)
    
    FUN <- function(Folder1) {
      #Read the file
      folder_input <- readLines(Folder1)
      #Extract the line which has "String" in it
      string <- grep("String[0-9]+_", folder_input, value = TRUE)
      #If such line exists
      if(length(string)) {
        #Remove everything till underscore to get 8-digit number
        output <- sub('.*_', '', string)
        #Remove everything after underscore to get "String1", "String12"
        out <- sub('_.*', '', string)
        #Write the output
        write(output, file= paste0('/pathtofile/', out, '.tex'))
      }
    }
    
    lapply(Folderno1, FUN)