Search code examples
rtext-filesreadlinereadr

Remove certain lines (with ---- and empty lines) from txt file using readLines() or read_lines()


I have this text file called textdata.txt:

TREATMENT DATA

------------------------------------
A: Text1
B: Text2

C: Text3
D: Text4

E: Text5
F: Text6
G: Text7

I would like to remove the whole line with --------- and the empty lines using readLines or read_lines:

When I use readLines("textdata.txt") I get:

 [1] "TREATMENT DATA"                      
 [2] ""                                    
 [3] "------------------------------------"
 [4] "A: Text1"                            
 [5] "B: Text2"                            
 [6] ""                                    
 [7] "C: Text3"                            
 [8] "D: Text4"                            
 [9] ""                                    
[10] "E: Text5"                            
[11] "F: Text6"                            
[12] "G: Text7"  

I would like to have, expected output:

 [1] "TREATMENT DATA"                      
 [2] "A: Text1"                            
 [3] "B: Text2"                                                               
 [4] "C: Text3"                            
 [5] "D: Text4"                                                             
 [6] "E: Text5"                            
 [7] "F: Text6"                            
 [8] "G: Text7"                                                             

Background: I have de facto no experience handling files with R. The basic idea is to get a .txt format from which I can load multiple text files stored in a folder to one dataframe.


Solution

  • 1) read.table If we can assume that the only occurrence of - is where shown in the question and if ? does not occur anywhere in the file then this will read in the data regarding every line as a single field and throwing away the header. Since - is the comment character lines with only - are regarded as blank and those will be thrown away. This reads the file into a one columnn data frame and the [[1]] returns that column as a character vector. If you want to keep the header omit header=TRUE.

    read.table("myfile", sep = "?", comment.char = "-", header = TRUE)[[1]]
    

    2) grep Another possibility is to read in the file and then remove lines that are empty or contain only - characters.

    grep("^-*$", readLines("myfile"), invert = TRUE, value = TRUE)
    

    3) pipe We could process the input using a filter and then pipe that into R. On Windows grep is found in C:\Rtools40\usr\bin if you have Rtools40 installed but if it is not on your path either use the complete path or if you don't have it at all replace grep with findstr. If on UNIX/Linux the escaping may vary according to which shell you are using.

    readLines(pipe('grep -v "^-*$" myfile'))