Search code examples
visual-studiosedcygwin

Why can't sed match more than one character at a time in this file?


I want to use sed to work with a bunch of files that visual studio produced. It would seem that there is something magical about the files in question that causes sed to behave differently, even when it is given identical strings:

Two scenarios that generate the same strings:

$ echo "#endif    // not APSTUDIO_INVOKED"
#endif    // not APSTUDIO_INVOKED

$ cat Version.rc.in | tail -n 3 | head -n 1
#endif    // not APSTUDIO_INVOKED

In either case, I can substitute one character at a time:

$ echo "#endif    // not APSTUDIO_INVOKED" | sed 's/A/B/'
#endif    // not BPSTUDIO_INVOKED

$ cat Version.rc.in | tail -n 3 | head -n 1 | sed 's/A/B/'
#endif    // not BPSTUDIO_INVOKED

But when I try to substitute more than one character, it fails for the file output, but succeeds for the echo output.

$ echo "#endif    // not APSTUDIO_INVOKED" | sed 's/AP/B/'
#endif    // not BSTUDIO_INVOKED

$ cat Version.rc.in | tail -n 3 | head -n 1 | sed 's/AP/B/'
#endif    // not APSTUDIO_INVOKED

Further tinkering has convinced me that the limitation has to do with sed's ability to match strings that are more than one character long. For example 's/A/XXX/' works but 's/AP/BB/' does not.

Why?

I am using Cygwin on Windows Server 2012

$ uname -a
CYGWIN_NT-6.3 MattsWinBox 2.3.1(0.291/5/3) 2015-11-14 12:44 x86_64 Cygwin

Solution

  • Jut a guess: The file from visual studio might be using UTF-16 encoding, which takes two bytes per character, and sed might be not aware of it. you try the following commands to check out:

    echo "#endif    // not APSTUDIO_INVOKED" | od -c
    cat Version.rc.in | tail -n 3 | head -n 1 | od -c
    

    od -c dumps the input data char by char, using local code for unprintable characters.

    For the first command, I get the following output on linux:

    0000000   #   e   n   d   i   f                   /   /       n   o   t
    0000020       A   P   S   T   U   D   I   O   _   I   N   V   O   K   E
    0000040   D  \n
    0000042