My goal is to extract strings of digits with fixed length (8 digits) from multiple text files in several folders after matching a certain pattern.
I have spent the whole day in order build a lapply-function out of it, so all the files (up to 20) in the subdirectories can be processed automatically. I fail though. The code of the workflow is executable, however due to my poor knowledge of R
limited to only one file.
In between the lines with numbers there is one string per file, each different, which I want to extract. The output of the string extraction should be stored folderwise.
The strings have the following structure: String[one or two digits]_[eight digits]
. For example, String1_20220101 or String12_20220108. I want to extract the part after the underscore.
The text files are structured in this way with each over 10000 rows.
Example for file 1:
X1 X2
1 1000 100
2 1050 100
3 1100 100
4 1150 100
5 1200 100
6 String1_20220101
7 1250 100
8 1300 100
9 1350 100
10 1400 100
x1 <- list(c(seq(1000,1400, by=50)))
[1] 1000 1050 1100 1150 1200 1250 1300 1350 1400
x2 <- list(c(rep(100, 9)))
[1] 100 100 100 100 100 100 100 100 100
File 2:
x1 x2
1 2000 200
2 3000 200
3 4000 200
4 5000 200
5 6000 200
6 7000 200
7 String12_20220108
8 8000 200
9 9000 200
10 10000 200
x1 <- list(c(seq(1000,10000,by=1000)))
[1] 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
x2 <- list(c(rep(200, 9)))
[1] 200 200 200 200 200 200 200 200 200
The files lie in numbered folders and derive their name from the folder number and belong to one observation.
My code for folder 1:
library(stringr)
Folderno1 <- list.files(path = "path/to/file/1/",
pattern = "*.txt",
full.names = TRUE)
FUN <- function(Folder1) {
folder_input <- readLines(Folderno1)
string <- grep("String[0-9]_", folder_input, value = TRUE)
output <- capture.output(as.numeric(str_extract_all(string, "(?<=[0-9]{1,2}_)[0-9]+")[[1]]))
write(output, file="/pathtofile/String1.tex")
}
lapply(Folderno1, FUN)
Error in str_extract_all(string, "(?<=[0-9]{1,2}_)[0-9]+")[[1]] :
subscript out of bounds
The above error message appears. The file String1.tex can be overwritten despite the error message, but only with one result:
[1] 20220101
The rerun with debug shows:
function (x)
.Internal(withVisible(x))
Could you please guide me how the workflow can be successfully changed, so every file can be processed? I can not get my head around it.
Thank you.
You are overwriting the same file every time (write(output, file="/pathtofile/String1.tex")
) in the function. Probably, you want to create a new .tex
file for every .txt
file.
From the error message I think there are certain files which do not have the pattern that we are looking for (String[0-9]_
). String[0-9]_
will not work with 2 digit numbers like String12_20220108
. I have changed it to use String[0-9]+_
. To be on safer side I have also added an if
condition to check the length of output.
Try this solution -
Folderno1 <- list.files(path = "path/to/file/1/",
pattern = "*.txt",
full.names = TRUE)
FUN <- function(Folder1) {
#Read the file
folder_input <- readLines(Folder1)
#Extract the line which has "String" in it
string <- grep("String[0-9]+_", folder_input, value = TRUE)
#If such line exists
if(length(string)) {
#Remove everything till underscore to get 8-digit number
output <- sub('.*_', '', string)
#Remove everything after underscore to get "String1", "String12"
out <- sub('_.*', '', string)
#Write the output
write(output, file= paste0('/pathtofile/', out, '.tex'))
}
}
lapply(Folderno1, FUN)