Search code examples
rregextextdata-conversion

How to convert complex text document to single character string


I have a manuscript and would like to extract all citations from it using regex. Working on a test sample from the manuscript I've developed a regex--see here: Regex in R: How to extract citations from manuscript. It works flawlessly on the sample, called samp:

str_extract_all(samp, "\\([A-Za-z][^)]*\\d{4};|;\\s[A-Za-z][^)]*\\d{4}\\)|\\([A-Za-z][^)]*\\d{4}.*?\\)|\\b[A-Z][a-z].*\\([^A-Za-z)]\\w.*?\\)|\\b[A-Z][a-z].*\\(forthcoming\\)|\\b[A-Z][a-z].*\\(in preparation\\)|\\([A-Za-z][^);]*\\d{4}|(?<=;\\s)[A-Za-z][^);]*\\d{4}")

BUT: the regex does not work well on the actual manuscript (which is obviously larger and may feature a more complex internal structure than the sample) because, unlike the sample, I cannot convert the manuscript into a single, coherent character string.

I've tried to read-in the document thus:

read.table([my path], header = F,  sep = "\n", fill = F, stringsAsFactors = F, strip.white = T)

and I've used paste to fuse it all together:

paste0(manuscript$V1, collapse = "")

but the resulting object still has internal divisions that prevent the regex working seamlessly on the whole document.

So how can the manuscript be read-in or post-processed in such a way that it constitutes a single uninterrupted string of characters?

Help with this question is much appreciated.


Solution

  • We can use readLines to get the content of the file as a list of lines which we in turn collapse to single uninterrupted string.

    manuscript <- paste0(readLines(path_to_file), collapse= "")
    

    Depending on the content of the file we would want to do some pre-processing before extracting the information. But this should get us the string in a form as shown in the sample in the question you linked in the post.