Search code examples
regexrcsvline-breaks

Remove line breaks, paragraph breaks in csv file using R


I have a csv file that contains some line breaks or paragraph breaks . How I know this is , when I open open this csv file in a word document I see the pilcrow symbol ¶, after the paragraph and before the beginning of the new paragraph. How do strip these line breaks from this csv file in R ? Any help is much appreciated.

enter image description here

PAST MEDICAL HISTORY

  1. Persistent atrial fibrillation with atrial flutter, status-post atrial flutter ablation line in October of 2002.
  2. Tachy/brady syndrome.
  3. Insulin-dependent diabetes. Has been diabetic for approximately 35 years.
  4. Hypertension, well

Solution

  • Here is an test case. You just want to remove empty lines. This is the file test.txt (complete with misspellings): (Note: your example is clearly not a csv file.)

    some header text
    
    more text
     even omre text
    

    ------------------

     txt= readLines("test.txt")
     newtext <- txt[nchar(txt)>0]
     newtext
    #[1] "some header text" "more text"        " even omre text"
    

    To remove numbered lines (ones that begin with digits followed by a period) one would post process that result with sub():

     txt <- "PAST MEDICAL HISTORY
    
     1. Persistent atrial fibrillation with atrial flutter, status-post atrial flutter ablation line in October of 2002.
     2. Tachy/brady syndrome.
     3. Insulin-dependent diabetes.  Has been diabetic for approximately 35 years.  
     4. Hypertension, well"
    
    
     newtxt= readLines(textConnection(txt))
     sub("^[[:digit:].]+", "", newtxt)
    #------------------------
    [1] "PAST MEDICAL HISTORY"                                                                                             
    [2] ""                                                                                                                 
    [3] " Persistent atrial fibrillation with atrial flutter, status-post atrial flutter ablation line in October of 2002."
    [4] " Tachy/brady syndrome."                                                                                           
    [5] " Insulin-dependent diabetes.  Has been diabetic for approximately 35 years.  "                                    
    [6] " Hypertension, well"     
    

    > sub("^[[:digit:].]+", "", newtxt[nchar(newtxt)>0])
    [1] "PAST MEDICAL HISTORY"                                                                                             
    [2] " Persistent atrial fibrillation with atrial flutter, status-post atrial flutter ablation line in October of 2002."
    [3] " Tachy/brady syndrome."                                                                                           
    [4] " Insulin-dependent diabetes.  Has been diabetic for approximately 35 years.  "                                    
    [5] " Hypertension, well"