Search code examples
bashshellifs

Shell, IFS, read and tabulation


I'm trying to read a TSV file in a shell script, and found that when IFS is set to tabulation, read skip null values. An exemple is better than 1000 word:

$ echo -e "a\tb\tc" | while IFS=$'\t' read v1 v2 v3; do echo "$v1 - $v2 - $v3"; done
a - b - c

This work as expected

$ echo -e "a\t\tc" | while IFS=$'\t' read v1 v2 v3; do echo "$v1 - $v2 - $v3"; done
a - c - 

I would have expected to have $v2 set to null, and $v3 to "c"

$ echo -e "a||c" | while IFS=$'|' read v1 v2 v3; do echo "$v1 - $v2 - $v3"; done
a -  - c

With | as delimiter, $v2 get a null value and $v3 get value "c" as I'm expecting.

Anyone has an explanation about the different behavior when using | or \t ? And a way to have \t behave like for | ?


Solution

  • Anyone has an explanation about the different behavior when using | or \t ?

    From posix read:

    The line shall be split into fields as in the shell (see Field Splitting); the first field shall be assigned to the first variable var, the second field to the second variable var, and so on. If there are fewer var operands specified than there are fields, the leftover fields and their intervening separators shall be assigned to the last var. If there are fewer fields than vars, the remaining vars shall be set to empty strings.

    So let's go to posix shell field splitting (emphasis mine):

    The shell shall treat each character of the IFS as a delimiter and use the delimiters to split the results of parameter expansion and command substitution into fields.

    1. If the value of IFS is a <space>, <tab>, and <newline>, or if it is unset, ... [doesn't apply here]
    2. If the value of IFS is null, ... [also doesn't apply here]
    3. Otherwise, the following rules shall be applied in sequence. The term "IFS white space" is used to mean any sequence (zero or more instances) of white space characters that are in the IFS value (for example, if IFS contains <space>/ <comma>/ <tab>, any sequence of <space>s and <tab>s is considered IFS white space).
      1. IFS white space shall be ignored at the beginning and end of the input.
      2. Each occurrence in the input of an IFS character that is not IFS white space, along with any adjacent IFS white space, shall delimit a field, as described previously.
      3. Non-zero-length IFS white space shall delimit a field.

    When IFS is set to any combination of whitespaces, then these whitespaces are concatenated together when splitting fields (ie. the "non-zer-length").

    So echo -e "a\t\tc" | IFS=$'\t' read v1 v2 v3 is equal to echo -e "a\t\t\t\t\tc" | IFS=$'\t' read v1 v2 v3. Because there are "are fewer fields than vars" (2 vs 3), v3 is set to empty string.

    But when IFS is set to anything else then whitespace, then each occurence of that IFS characters splits fields.

    Yet another funny corner case, where whitespace characters are treated specially.

    And a way to have \t behave like for | ?

    In bash, replace it for something unique before reading. I like to use 0x01 byte:

    echo -e "a\t\tc" |
        tr '\t' $'\x01' |
        while IFS=$'\x01' read -r v1 v2 v3; do echo "$v1 - $v2 - $v3"; done
    

    Remember to use read -r.