I have a csv file that contains some line breaks or paragraph breaks . How I know this is , when I open open this csv file in a word document I see the pilcrow symbol ¶, after the paragraph and before the beginning of the new paragraph. How do strip these line breaks from this csv file in R ? Any help is much appreciated.
PAST MEDICAL HISTORY
Here is an test case. You just want to remove empty lines. This is the file test.txt
(complete with misspellings):
(Note: your example is clearly not a csv file.)
some header text
more text
even omre text
txt= readLines("test.txt")
newtext <- txt[nchar(txt)>0]
newtext
#[1] "some header text" "more text" " even omre text"
To remove numbered lines (ones that begin with digits followed by a period) one would post process that result with sub():
txt <- "PAST MEDICAL HISTORY
1. Persistent atrial fibrillation with atrial flutter, status-post atrial flutter ablation line in October of 2002.
2. Tachy/brady syndrome.
3. Insulin-dependent diabetes. Has been diabetic for approximately 35 years.
4. Hypertension, well"
newtxt= readLines(textConnection(txt))
sub("^[[:digit:].]+", "", newtxt)
#------------------------
[1] "PAST MEDICAL HISTORY"
[2] ""
[3] " Persistent atrial fibrillation with atrial flutter, status-post atrial flutter ablation line in October of 2002."
[4] " Tachy/brady syndrome."
[5] " Insulin-dependent diabetes. Has been diabetic for approximately 35 years. "
[6] " Hypertension, well"
> sub("^[[:digit:].]+", "", newtxt[nchar(newtxt)>0])
[1] "PAST MEDICAL HISTORY"
[2] " Persistent atrial fibrillation with atrial flutter, status-post atrial flutter ablation line in October of 2002."
[3] " Tachy/brady syndrome."
[4] " Insulin-dependent diabetes. Has been diabetic for approximately 35 years. "
[5] " Hypertension, well"