Search code examples
rtext-miningtm

tm_map merging lines on condition


I extracted the text from pdf files and created a corpus object.

Within the texts, I have lines ending with "," or "-" and I would like to append to them the following line, because it belongs to the same sentence.

For instance I have

[1566] "this and other southeastern states (Eukerria saltensis,"      
[1567] "Sparganophilus helenae, Sp. tennesseensis). In the" 

And I would like to have instead

[1566] "this and other southeastern states (Eukerria saltensis, Sparganophilus helenae, Sp. tennesseensis). In the" 

I tried things like replacing line breaks, but with no success :

tm_map(myCorpus, content_transformer(gsub), pattern =",$\n",replacement = "")

Any idea on how I can do this in R?


Solution

  • Thanks, it does work!

    I had to put it in a function to make it work with tm_map, though:

    clean.X <- function(X){
    
      X2 <- paste0(X,collapse="\n")
      X2 <- gsub(",\\n",", ",X2)
      X2 <- gsub("\\-\\n","-",X2)
      X2 <- unlist(strsplit(X2,"\\n"))
      return(X2)
    
     }
    
    txt2 <- tm_map(txt, content_transformer(clean.X))