Search code examples
linuxbashfor-loopcat

Why can't you use cat to read a file line by line where each line has delimiters


I have a text file that contains something like this:

abc 123, comma
the quick brown fox
jumped over the lazy dog
comma, comma

I wrote a script

for i in `cat file`
do
   echo $i
done

For some reason, the output of the script doesn't output the file line by line but breaks it off at the commas, as well as the newline. Why is cat or for blah in `cat xyz` doing this and how can I make it NOT do this? I know I can use a

while read line
do
   blah balh blah
done < file

but I want to know why cat or the for var in is doing this to further my understanding of Unix commands. cat's man page didn't help me and looking at for or looping in the bash manual didn't yield any answers (http://www.gnu.org/software/bash/manual/bashref.html). Thanks in advance for your help.


Solution

  • The problem is not in cat, nor in the for loop per se; it is in the use of back quotes. When you write either:

    for i in `cat file`
    

    or (better):

    for i in $(cat file)
    

    or (in ksh, zsh or bash¹):

    for i in $(<file)
    

    the shell executes the command and captures the output as a string, removes trailing newline characters (and all NULs with bash) separating the words at the characters in $IFS and (except in zsh) performs globbing aka filename generation aka pathname expansion on the resulting words. If you want lines input to $i, you either have to fiddle with IFS or use the while loop. The while loop is better if there's any danger that the files processed will be large; it doesn't have to read the whole file into memory all at once, and doesn't perform globbing and doesn't skip empty lines unlike the versions using $(...).

    IFS='
    '
    set -o noglob # disable globbing
    for i in $(<file)
    do printf '%s\n' "$i"
    done
    

    The quotes around the "$i" are generally a good idea. In this context, with the modified $IFS, and globbing disabled, it actually isn't critical, but good habits are good habits even so. printf is better than echo, as echo would output nothing or an empty line for input lines containing -n, -nene, -eee or depending on the echo implementation and/or environment mangle backslashes. That matters in the following script:

    old="$IFS"
    IFS='
    '
    set -o noglob
    for i in $(<file)
    do
       (
       IFS="$old"
       set +o noglob
       printf '%s\n' "$i"
       )
    done
    

    when the data file contains tabulations or multiple spaces (both of which are in the default value of $IFS) or wildcards or leading trailing whitespace

    $ cat file
    abc                  123
      foo
    -Enee
    /e* /b*
    $ 
    

    Output:

    $ sh bq.sh
    abc                  123
      foo
    -Enee
    /e* /b*
    $
    

    With echo and without the double quotes:

    $ cat bq.sh
    old="$IFS"
    IFS='
    '
    set -o noglob
    for i in $(<file)
    do
       (
       IFS="$old"
       set +o noglob
       echo $i
       )
    done
    $ sh bq.sh
    abc 123
    foo
    /etc /bin /boot
    $
    

    For the while read loop, the syntax should be:

    while IFS= read -r line
    do
       printf '%s\n' "$line"
    done < file
    
    • without -r, read would mangle backslashes
    • without IFS=, read would remove leading and trailing space and tabs (assuming the default value of $IFS).
    • printf should be used instead of echo, and $line quoted for the same reasons as above.

    ¹ Though in bash it's much less of an optimisation as bash still forks a child process to perform the expansion.