#!/usr/bin/bash
mailing_list="Jane Doe
123 Main Street
Anywhere, SE 12345-6789
John Smith
456 Tree-lined Avenue
Smallville, MW 98765-4321
Amir Faquer
C. de la Lusitania 98
08206 Sabadell
Amir Faquer w spaces before
C. de la Lusitania 98
08206 Sabadel
Wife w spaces before
C. de la Lusitania 98
08206 Sabadell
"
echo "$mailing_list"|awk -v RS='' -v FS='\n' '/.*/
END {print "The number of records is "NR"."}'
echo "$mailing_list"|awk -v RS='\n\n+' -v FS='\n' '/.*/
END {print "The number of records is "NR"."}'
echo "$mailing_list"|awk -v RS='\n *\n+' -v FS='\n' '/.*/
END {print "The number of records is "NR"."}'
How do I split this mailing-list into multi-line records, not just when there is just with RS='\n\n+'
. The last line of my code infms me that the number of records is seven, which is not correct - there are just five records. I also want the the blank lines that have arbitrary amounts of whitespace to act as RS
. How might I accomplish that?
You can put any awk into "paragraph mode" by setting RS
to null. In that mode awk will treat any sequence of 1 or more empty lines as the record separator:
$ printf ' foo\n\tbar\n\n etc\n'
foo
bar
etc
$ printf ' foo\n\tbar\n\n etc\n' |
awk -v RS= '{print NR, "<"$0">"}'
1 < foo
bar>
2 < etc>
I'm including white space at the start of lines to ensure that the solution proposed won't treat them as part of the RS
.
That doesn't do everything you want though as you also want lines that contain only white space to be considered part of the record separator but the above won't do that:
$ printf ' foo\n\tbar\n \n etc\n' |
awk -v RS= '{print NR, "<"$0">"}'
1 < foo
bar
etc>
To include lines of all white space in the RS you need to write a regexp to do that. POSIX awk doesn't support multi-char RS, it only allows a single char regexp, but with GNU awk and a couple of others now, you CAN use a multi-char regexp as the separator and \s
can be used as shorthand for [[:space:]]
:
$ printf ' foo\n\tbar\n \n etc\n' |
awk -v RS='\n((\\s*\n)|$)' '{print NR, "<"$0">"}'
1 < foo
bar>
2 < etc>
The |$
is necessary so that the newline at the end of the file doesn't become part of the last record:
$ printf ' foo\n\tbar\n \n etc\n' |
awk -v RS='\n(\\s*\n)' '{print NR, "<"$0">"}'
1 < foo
bar>
2 < etc
>
Note that you don't need to double the \
in \n
as it's one of the specific escape sequences defined at https://www.gnu.org/software/gawk/manual/gawk.html#Escape-Sequences. You do need to double escape \s
and anything else not defined in that link because when the string is converted to a regexp to be used as a field separator one layer of \
s gets consumed.
That doesn't completely solve the problem though as you may need to consider blank lines at the start or end of your input.
$ printf '\n foo\n\tbar\n \n etc\n\n' |
awk -v RS='\n(\\s*\n)' '{print NR, "<"$0">"}'
1 <
foo
bar>
2 < etc>
The blank line at the end is being ignored which is correct behavior since multiple blank lines are your RS, but the blank line at the start should be considered as either:
If you want "1" above then:
$ printf '\n foo\n\tbar\n \n etc\n\n' |
awk -v RS='(^|\n)((\\s*\n)|$)' '{print NR, "<"$0">"}'
1 <>
2 < foo
bar>
3 < etc>
but if you want "2" above (to emulate what "paragraph mode" does) then:
$ printf '\n foo\n\tbar\n \n etc\n\n' |
awk -v RS='(^|\n)((\\s*\n)|$)' '(NR==1) && /^\s*$/{NR--; next} {print NR, "<"$0">"}'
1 < foo
bar>
2 < etc>
and if you want the same behavior with a POSIX awk then:
$ printf '\n foo\n\tbar\n \n etc\n\n' |
awk '
/^[[:space:]]*$/ { $0=rec; if ($0 != "") print ++nr, "<"$0">"; rec=""; next }
{ rec = ( rec == "" ? "" : rec ORS ) $0 }
END { $0=rec; if ($0 != "") print ++nr, "<"$0">" }
'
1 < foo
bar>
2 < etc>
There may be cases I haven't thought through above, e.g. no input or all blank input, if the above doesn't do what you want for those cases, they are left as an exercise :-).