Search code examples
linuxunixsedeoldos2unix

How to find a windows end of line (EOL) character


I have several hundred GB of data that I need to paste together using the unix paste utility in Cygwin, but it won't work properly if there are windows EOL characters in the files. The data may or may not have windows EOL characters, and I don't want to spend the time running dos2unix if I don't have to.

So my question is, in Cygwin, how can I figure out whether these files have windows EOL CRLF characters?

I've tried creating some test data and running

sed -r 's/\r\n//' testdata.txt

But that appears to match regardless of whether dos2unix has been run or not.

Thanks.


Solution

  • The file(1) utility knows the difference:

    $ file * | grep ASCII
    2:                                       ASCII text
    3:                                       ASCII English text
    a:                                       ASCII C program text
    blah:                                    ASCII Java program text
    foo.js:                                  ASCII C++ program text
    openssh_5.5p1-4ubuntu5.dsc:              ASCII text, with very long lines
    windows:                                 ASCII text, with CRLF line terminators
    

    file(1) has been optimized to try to read as little of a file as possible, so you may be lucky and drastically reduce the amount of disk IO you need to perform when finding and fixing the CRLF terminators.

    Note that some cases of CRLF should stay in place: captures of SMTP will use CRLF. But that's up to you. :)