Search code examples
regexawkmultiline

How do I split this text into multiline records?


#!/usr/bin/bash

mailing_list="Jane Doe
123 Main Street
Anywhere, SE 12345-6789

John Smith
456 Tree-lined Avenue
Smallville, MW 98765-4321



Amir Faquer
C. de la Lusitania 98
08206 Sabadell
        
Amir Faquer w spaces before
C. de la Lusitania 98
08206 Sabadel
    
      
      
      
Wife w spaces before
C. de la Lusitania 98
08206 Sabadell
"
echo "$mailing_list"|awk -v RS='' -v FS='\n' '/.*/ 
END {print "The number of records is "NR"."}'

echo "$mailing_list"|awk -v RS='\n\n+' -v FS='\n' '/.*/ 
END {print "The number of records is "NR"."}'

echo "$mailing_list"|awk -v RS='\n *\n+' -v FS='\n' '/.*/ 
END {print "The number of records is "NR"."}'


How do I split this mailing-list into multi-line records, not just when there is just with RS='\n\n+'. The last line of my code infms me that the number of records is seven, which is not correct - there are just five records. I also want the the blank lines that have arbitrary amounts of whitespace to act as RS. How might I accomplish that?


Solution

  • You can put any awk into "paragraph mode" by setting RS to null. In that mode awk will treat any sequence of 1 or more empty lines as the record separator:

    $ printf '  foo\n\tbar\n\n    etc\n'
      foo
            bar
    
        etc
    $ printf '  foo\n\tbar\n\n    etc\n' |
        awk -v RS= '{print NR, "<"$0">"}'
    1 <  foo
            bar>
    2 <    etc>
    

    I'm including white space at the start of lines to ensure that the solution proposed won't treat them as part of the RS.

    That doesn't do everything you want though as you also want lines that contain only white space to be considered part of the record separator but the above won't do that:

    $ printf '  foo\n\tbar\n    \n    etc\n' |
        awk -v RS= '{print NR, "<"$0">"}'
    1 <  foo
            bar
    
        etc>
    

    To include lines of all white space in the RS you need to write a regexp to do that. POSIX awk doesn't support multi-char RS, it only allows a single char regexp, but with GNU awk and a couple of others now, you CAN use a multi-char regexp as the separator and \s can be used as shorthand for [[:space:]]:

    $ printf '  foo\n\tbar\n    \n    etc\n' |
        awk -v RS='\n((\\s*\n)|$)' '{print NR, "<"$0">"}'
    1 <  foo
            bar>
    2 <    etc>
    

    The |$ is necessary so that the newline at the end of the file doesn't become part of the last record:

    $ printf '  foo\n\tbar\n    \n    etc\n' |
        awk -v RS='\n(\\s*\n)' '{print NR, "<"$0">"}'
    1 <  foo
            bar>
    2 <    etc
    >
    

    Note that you don't need to double the \ in \n as it's one of the specific escape sequences defined at https://www.gnu.org/software/gawk/manual/gawk.html#Escape-Sequences. You do need to double escape \s and anything else not defined in that link because when the string is converted to a regexp to be used as a field separator one layer of \s gets consumed.

    That doesn't completely solve the problem though as you may need to consider blank lines at the start or end of your input.

    $ printf '\n  foo\n\tbar\n    \n    etc\n\n' |
        awk -v RS='\n(\\s*\n)' '{print NR, "<"$0">"}'
    1 <
      foo
            bar>
    2 <    etc>
    

    The blank line at the end is being ignored which is correct behavior since multiple blank lines are your RS, but the blank line at the start should be considered as either:

    1. the end of a preceding record which is empty, or
    2. something to be ignored (as is done in "paragraph mode").

    If you want "1" above then:

    $ printf '\n  foo\n\tbar\n    \n    etc\n\n' |
        awk -v RS='(^|\n)((\\s*\n)|$)' '{print NR, "<"$0">"}'
    1 <>
    2 <  foo
            bar>
    3 <    etc>
    

    but if you want "2" above (to emulate what "paragraph mode" does) then:

    $ printf '\n  foo\n\tbar\n    \n    etc\n\n' |
        awk -v RS='(^|\n)((\\s*\n)|$)' '(NR==1) && /^\s*$/{NR--; next} {print NR, "<"$0">"}'
    1 <  foo
            bar>
    2 <    etc>
    

    and if you want the same behavior with a POSIX awk then:

    $ printf '\n  foo\n\tbar\n    \n    etc\n\n' |
        awk '
            /^[[:space:]]*$/ { $0=rec; if ($0 != "") print ++nr, "<"$0">"; rec=""; next }
            { rec = ( rec == "" ? "" : rec ORS ) $0 }
            END { $0=rec; if ($0 != "") print ++nr, "<"$0">" }
        '
    1 <  foo
            bar>
    2 <    etc>
    

    There may be cases I haven't thought through above, e.g. no input or all blank input, if the above doesn't do what you want for those cases, they are left as an exercise :-).