Search code examples
clinuxgccposixifs

Linux IFS environment variable is not used by gcc wordexp Posix C library function for splitting words


Environment

OS: Ubunty 20.4, Centos 8, macOS Catalina 10.15.7
Language: C, C++
Compiler: gcc (most recent versions for each OS)

Issue

I am using wordexp Posix library function to get shell-like expansion of strings.
The expansion works fine with one exception: when I set $IFS environment variable to something other than whitespace, for example ':', it does not seem to affect splitting of the words that continues to be done on whitespace only regardless of the IFS value.

bash test

Man page for wordexp for Linux https://man7.org/linux/man-pages/man3/wordexp.3.html states:

  1. "The function wordexp() performs a shell-like expansion of the string..."
  2. "Field splitting is done using the environment variable $IFS. If it is not set, the field separators are space, tab and newline."

This is why I expected wordexp to behave the same way as bash in this respect.
On all the listed OSes I got the same exactly correct and expected result when changing the character set used for splitting:
Using default (IFS is not set)

    read -a words <<<"1 2:3 4:5"
    for word in "${words[@]}"; do echo "$word";  done

correctly splits on space and produces the result:

    1
    2:3
    4:5

while setting IFS to ':'

    IFS=':' read -a words <<<"1 2:3 4:5"
    for word in "${words[@]}"; do echo "$word";  done

correctly splits on ':' and produces the result:

    1 2
    3 4
    5

C code test

But running the code below yields the same result regardless whether IFS environment variable is set or not:

C Code:

    #include <stdio.h>
    #include <wordexp.h>
    #include <stdlib.h>
    
    static void expand(char const *title, char const *str)
    {
        printf("%s input: %s\n", title, str);
        wordexp_t exp;
        int rcode = 0;
        if ((rcode = wordexp(str, &exp, WRDE_NOCMD)) == 0) {
            printf("output:\n");
            for (size_t i = 0; i < exp.we_wordc; i++)
                printf("%s\n", exp.we_wordv[i]);
            wordfree(&exp);
        } else {
            printf("expand failed %d\n", rcode);
        }
    }
    
    int main()
    {
        char const *str = "1 2:3 4:5";
        
        expand("No IFS", str);
    
        int rcode = setenv("IFS", ":", 1);
        if ( rcode != 0 ) {
            perror("setenv IFS failed: ");
            return 1;
        }
    
        expand("IFS=':'", str);
    
        return 0;
    }

The result in all OSes is the same:

    No IFS input: 1 2:3 4:5
    output:
    1
    2:3
    4:5
    IFS=':' input: 1 2:3 4:5
    output:
    1
    2:3
    4:5

As a note, the snippet above was created for this post - I did test with a more complex code that verified that the environment variable was indeed set properly.

Source code review

I looked at the source code for the wordexp function implementation available at https://code.woboq.org/userspace/glibc/posix/wordexp.c.html and it appears that it does use $IFS but perhaps inconsistently or maybe this is a bug.
Specifically:
In the body of wordexp that starts on line 2229 it does get IFS environment variable value and processes it:
lines 2273 - 2276:

     /* Find out what the field separators are.
       * There are two types: whitespace and non-whitespace.
       */
      ifs = getenv ("IFS");

But then later on in the function it does not seem to use the $IFS values for words separation.
This looks like a bug unless "field separators" on line 2273 and "word separator" on line 2396 mean different things.
lines 2395 - 2398:

          default:
            /* Is it a word separator? */
            if (strchr (" \t", words[words_offset]) == NULL)
            {

But in any case the code seem to only use space or tab as a splitter unlike bash that respects the IFS set splitter values.

Questions

  1. Am I missing something and there is a way to get wordexp to split on characters other than whitespace?
  2. If the split is only on whitespace, is this a bug in the
    • gcc library implementation or
    • in the Linux man page for wordexp where they claim that $IFS can be used to define splitters

Many thanks in advance for all your comments and insights!

Answers Summary and workaround

In the accepted answer there was a hint on how to achieve the split on non-whitespace characters from the $IFS: you have to set $IFS and put the string that you want to split as a value for a temporary environmental variable and then call wordexp against that temporary variable. This is demonstrated in the updated code below.
While this behavior that is visible in the source code may not be actually a bug it definitely looks like a questionable design decision to me…
Updated code:

    #include <stdio.h>
    #include <wordexp.h>
    #include <stdlib.h>
    
    static void expand(char const *title, char const *str)
    {
        printf("%s input: %s\n", title, str);
        wordexp_t exp;
        int rcode = 0;
        if ((rcode = wordexp(str, &exp, WRDE_NOCMD)) == 0) {
            printf("output:\n");
            for (size_t i = 0; i < exp.we_wordc; i++)
                printf("%s\n", exp.we_wordv[i]);
            wordfree(&exp);
        } else {
            printf("expand failed %d\n", rcode);
        }
    }
    
    int main()
    {
        char const *str = "1 2:3 4:5";
        
        expand("No IFS", str);
    
        int rcode = setenv("IFS", ":", 1);
        if ( rcode != 0 ) {
            perror("setenv IFS failed: ");
            return 1;
        }
    
        expand("IFS=':'", str);
        
        rcode = setenv("FAKE", str, 1);
        if ( rcode != 0 ) {
            perror("setenv FAKE failed: ");
            return 2;
        }
    
        expand("FAKE", "${FAKE}");    
    
        return 0;
    }

which produces the result:

    No IFS input: 1 2:3 4:5
    output:
    1
    2:3
    4:5
    IFS=':' input: 1 2:3 4:5
    output:
    1
    2:3
    4:5
    FAKE input: ${FAKE}
    output:
    1 2
    3 4
    5

Solution

  • You're comparing apples to oranges. wordexp() splits a string up into individual tokens the same way the shell does. The shell builtin read doesn't follow the same algorithm; it just does word splitting. You should be comparing wordexp() to how the arguments to a script or shell function are parsed:

    #!/bin/sh
    
    printwords() {
        for arg in "$@"; do
            printf "%s\n" "$arg"
        done
    }
    
    echo "No IFS input: 1 2:3 4:5"
    printwords 1 2:3 4:5
    echo "IFS=':' input: 1 2:3 4:5"
    IFS=:
    printwords 1 2:3 4:5
    

    This produces

    No IFS input: 1 2:3 4:5
    1
    2:3
    4:5
    IFS=':' input: 1 2:3 4:5
    1
    2:3
    4:5
    

    just like the C program.


    Now, for the interesting bit. I couldn't find it explicitly mentioned as such in the POSIX documentation with a quick scan, but the bash manual has this to say about word splitting:

    Note that if no expansion occurs, no splitting is performed.

    Let's try a version that does parameter expansion in its arguments:

    #!/bin/sh
    
    printwords() {
        for arg in "$@"; do
            printf "%s\n" "$arg"
        done
    }
    
    foo=2:3
    printf "foo = %s\n" "$foo"
    printf "No IFS input: 1 \$foo 4:5\n"
    printwords 1 $foo 4:5
    printf "IFS=':' input: 1 \$foo 4:5\n"
    IFS=:
    printwords 1 $foo 4:5
    

    which when run via shells like dash, ksh93 or bash (But not zsh unless you turn on the SH_WORD_SPLIT option), produces

    foo = 2:3
    No IFS input: 1 $foo 4:5
    1
    2:3
    4:5
    IFS=':' input: 1 $foo 4:5
    1
    2
    3
    4:5
    

    As you can see, the argument that has a parameter was subject to field splitting, but not the literal one. Making the same change to the string in your C program and running foo=2:3 ./wordexp prints out the same thing.