Search code examples
rstringline-breaks

Line breaks removal in R


I'm following the code of this tutorial https://www.youtube.com/watch?v=JyMBwydhYR8

All_Files <- list.files(pattern = "pdf$")
All_opinions <- lapply(All_Files, pdf_text)

document <-  Corpus(VectorSource(All_opinions))

social_sentences <- document %>%
    tolower() %>%
    paste0(collapse= " ") %>%
    stringr::str_squish() %>%
    stringr::str_split(fixed(".")) %>%
    unlist() %>%
    tm::removePunctuation()

But after creating the vector 'social_sentences', the line breaks weren't removed.

Instead, after removing the punctuation, only the 'n' letter is left, which joins with the closer words.

Even in the tutorial, it is possible to see it with the word 'hilln'.

The 'str_squish()' function is already part of the code, and I even changed its place to see if it solves the problem. I also tried the 'gsub()' and 'str_replace_all()' functions.


Solution

  • It is true, in the video extra n symbols has appeared. But in fact the code does remove \n completely.

    Try this:

    text <- "\n\nString with excess,  trailing and: leading! white   space\n\n"
    text %>%
      tolower() %>%
      paste0(collapse= " ") %>%
      stringr::str_split(fixed(".")) %>%
      unlist() %>%
      tm::removePunctuation() %>%  
      stringr::str_squish() 
    

    The result is:

    [1] "string with excess trailing and leading white space"
    
    EDIT:

    just add str_replace_all("\\\\n", " ") to your pipe:

    > pdf_text("stack_1003.pdf") |>
    +   VectorSource()|>
    +   Corpus() |>
    +   tolower() |> 
    +   unlist() |>
    +   paste0(collapse= " ") |>
    +   str_split(fixed(".")) |>
    +   str_replace_all("\\\\n", " ") |> 
    +   removePunctuation() |> 
    +   str_squish() 
    [1] "c string with excess trailing and leading white space string with excess trailing and leading white space string with excess trailing and leading white space string with excess trailing and leading white space string with excess trailing and leading white space string with excess trailing and leading white space string with excess trailing and leading white space string with excess trailing and leading white space string with excess trailing and leading white space string with excess trailing and leading white space string with excess trailing and leading white space string with excess trailing and leading white space string with excess trailing and leading white space string with excess trailing and leading white space string with excess trailing and leading white space string with excess trailing and leading white space string with excess trailing and leading white space string with excess trailing and leading white space string with excess trailing and leading white space