I'd like to know if the following behavior is a reliable:
echo 😄 | gawk '{split($0,array,"."); print array[1] length(array);}'
Output is: 😄1
vs
echo 😄 | gawk '{patsplit($0,array,"."); print array[1] length(array);}'
Output is: �4
patsplit
is working on bytes vs split
on characters, but I haven't found this documented or discussed anywhere. Question is, can I count on this behavior and where?
This is because split
and patsplit
are fundamentally different functions.
split
divides a string into fields by a field separator, i.e. what's between the fields, while patsplit
divides a string into fields by matching fields themselves with a field pattern.
All gawk functions, including split
and patsplit
, work on locale-dependent characters, not bytes, per the documentation.
Also, a single-character string such as "."
as a field separator is treated literally rather than as a regex pattern (see the documentation on FS
).
Since there is no .
in the input string of 😄, when you call split
with "."
as the field separator, split
sees only 1 field.
And since 😄 consists of 4 bytes and presumably you have set your locale to a byte-based one such as C
, when you call patsplit
with "."
as the field pattern, each .
matches one byte of 😄, producing an array of size 4.