I extracted the text from pdf files and created a corpus object.
Within the texts, I have lines ending with "," or "-" and I would like to append to them the following line, because it belongs to the same sentence.
For instance I have
[1566] "this and other southeastern states (Eukerria saltensis,"
[1567] "Sparganophilus helenae, Sp. tennesseensis). In the"
And I would like to have instead
[1566] "this and other southeastern states (Eukerria saltensis, Sparganophilus helenae, Sp. tennesseensis). In the"
I tried things like replacing line breaks, but with no success :
tm_map(myCorpus, content_transformer(gsub), pattern =",$\n",replacement = "")
Any idea on how I can do this in R?
Thanks, it does work!
I had to put it in a function to make it work with tm_map, though:
clean.X <- function(X){
X2 <- paste0(X,collapse="\n")
X2 <- gsub(",\\n",", ",X2)
X2 <- gsub("\\-\\n","-",X2)
X2 <- unlist(strsplit(X2,"\\n"))
return(X2)
}
txt2 <- tm_map(txt, content_transformer(clean.X))