Search code examples
bashsortingtextawk

Sort a text file by line length including spaces


I have a CSV file that looks like this

AS2345,ASDF1232, Mr. Plain Example, 110 Binary ave.,Atlantis,RI,12345,(999)123-5555,1.56
AS2345,ASDF1232, Mrs. Plain Example, 1121110 Ternary st.                                        110 Binary ave..,Atlantis,RI,12345,(999)123-5555,1.56
AS2345,ASDF1232, Mr. Plain Example, 110 Binary ave.,Liberty City,RI,12345,(999)123-5555,1.56
AS2345,ASDF1232, Mr. Plain Example, 110 Ternary ave.,Some City,RI,12345,(999)123-5555,1.56

I need to sort it by line length including spaces. The following command doesn't include spaces, is there a way to modify it so it will work for me?

cat $@ | awk '{ print length, $0 }' | sort -n | awk '{$1=""; print $0}'

Solution

  • Answer

    < testfile awk '{ print length, $0 }' | sort -n -s | cut -d" " -f2-
    

    Or, to do your original (perhaps unintentional) sub-sorting of any equal-length lines:

    < testfile awk '{ print length, $0 }' | sort -n | cut -d" " -f2-
    

    In both cases, we have solved your stated problem by moving away from awk for your final cut.

    Lines of matching length - what to do in the case of a tie:

    The question did not specify whether or not further sorting was wanted for lines of matching length. I've assumed that this is unwanted and suggested the use of -s (--stable) to prevent such lines being sorted against each other, and keep them in the relative order in which they occur in the input.

    (Those who want more control of sorting these ties might look at sort's --key option.)

    Why the question's attempted solution fails (awk line-rebuilding):

    It is interesting to note the difference between:

    echo "hello   awk   world" | awk '{print}'
    echo "hello   awk   world" | awk '{$1="hello"; print}'
    

    They yield respectively

    hello   awk   world
    hello awk world
    

    The relevant section of (gawk's) manual only mentions as an aside that awk is going to rebuild the whole of $0 (based on the separator, etc) when you change one field. I guess it's not crazy behaviour. It has this:

    "Finally, there are times when it is convenient to force awk to rebuild the entire record, using the current value of the fields and OFS. To do this, use the seemingly innocuous assignment:"

     $1 = $1   # force record to be reconstituted
     print $0  # or whatever else with $0
    

    "This forces awk to rebuild the record."

    Test input including some lines of equal length:

    aa A line   with     MORE    spaces
    bb The very longest line in the file
    ccb
    9   dd equal len.  Orig pos = 1
    500 dd equal len.  Orig pos = 2
    ccz
    cca
    ee A line with  some       spaces
    1   dd equal len.  Orig pos = 3
    ff
    5   dd equal len.  Orig pos = 4
    g