I have a manuscript and would like to extract all citations from it using regex. Working on a test sample from the manuscript I've developed a regex--see here: Regex in R: How to extract citations from manuscript. It works flawlessly on the sample, called samp
:
str_extract_all(samp, "\\([A-Za-z][^)]*\\d{4};|;\\s[A-Za-z][^)]*\\d{4}\\)|\\([A-Za-z][^)]*\\d{4}.*?\\)|\\b[A-Z][a-z].*\\([^A-Za-z)]\\w.*?\\)|\\b[A-Z][a-z].*\\(forthcoming\\)|\\b[A-Z][a-z].*\\(in preparation\\)|\\([A-Za-z][^);]*\\d{4}|(?<=;\\s)[A-Za-z][^);]*\\d{4}")
BUT: the regex does not work well on the actual manuscript (which is obviously larger and may feature a more complex internal structure than the sample) because, unlike the sample, I cannot convert the manuscript into a single, coherent character string.
I've tried to read-in the document thus:
read.table([my path], header = F, sep = "\n", fill = F, stringsAsFactors = F, strip.white = T)
and I've used paste
to fuse it all together:
paste0(manuscript$V1, collapse = "")
but the resulting object still has internal divisions that prevent the regex working seamlessly on the whole document.
So how can the manuscript be read-in or post-processed in such a way that it constitutes a single uninterrupted string of characters?
Help with this question is much appreciated.
We can use readLines
to get the content of the file as a list of lines which we in turn collapse to single uninterrupted string.
manuscript <- paste0(readLines(path_to_file), collapse= "")
Depending on the content of the file we would want to do some pre-processing before extracting the information. But this should get us the string in a form as shown in the sample in the question you linked in the post.