Search code examples
rread.tableeol

In R, how to read file with custom end of line (eol)


I have a text file to read in R (and store in a data.frame). The file is organized in several rows and columns. Both "sep" and "eol" are customized.

Problem: the custom eol, i.e. "\t&nd" (without quotations), can't be set in read.table(...) (or read.csv(...), read.csv2(...),...) nor in fread(...), and I can't able to find a solution.

I'have search here ("[r] read eol" and other I don't remember) and I don't find a solution: the only one was to preprocess the file changing the eol (not possible in my case because into some fields I can find something like \n, \r, \n\r, ", ... and this is the reason for the customization).

Thanks!


Solution

  • You could approach this two different ways:

    A. If the file is not too wide, you can read your desired rows using scan and split it into your desired columns with strsplit, then combine into a data.frame. Example:

    # Provide reproducible example of the file ("raw.txt" here) you are starting with
    your_text <- "a~b~c!1~2~meh!4~5~wow"
    write(your_text,"raw.txt"); rm(your_text)  
    
    eol_str = "!" # whatever character(s) the rows divide on
    sep_str = "~" # whatever character(s) the columns divide on
    
    # read and parse the text file   
    # scan gives you an array of row strings (one string per row)
    # sapply strsplit gives you a list of row arrays (as many elements per row as columns)
    f <- file("raw.txt")
    row_list <- sapply(scan("raw.txt", what=character(), sep=eol_str), 
                       strsplit, split=sep_str) 
    close(f)
    
    df <- data.frame(do.call(rbind,row_list[2:length(row_list)]))
    row.names(df) <- NULL
    names(df) <- row_list[[1]]
    
    df
    #   a b   c
    # 1 1 2 meh
    # 2 4 5 wow
    

    B. If A doesn't work, I agree with @BondedDust that you probably need an external utility -- but you can invoke it in R with system() and do a find/replace to reformat your file for read.table. Your invocation will be specific to your OS. Example: https://askubuntu.com/questions/20414/find-and-replace-text-within-a-file-using-commands . Since you note that you have \n, and \r\n in your text already, I recommend that you first find and replace them with temporary placeholders -- perhaps quoted versions of themselves -- and then you can convert them back after you have built your data.frame.