Search code examples
stringpowershellsedescaping

sed giving error when repetition is used in powershell


I was writing a powershell script and I need to use sed to extract some part of the output of a different command. It looks something like this:

echo "d6121090" | sed -E "s/^d6(.*)10.*/\\1/"
12

But if I replace .* in sed with .{0,} or with {0,2} sed is giving me error:

echo "d6121090" | sed -E "s/^d6(.{0,})10.*/\\1/"
/usr/bin/sed: can't read s/^d6(.)10.*/\1/: No such file or directory
echo "d6121090" | sed -E "s/^d6(.{0,2})10.*/\\1/"
/usr/bin/sed: can't read s/^d6(.2)10.*/\1/: No such file or directory

I'm not sure if the error has something to do with sed or powershell.

I'm using sed provided by Cygwin (4.9-1), version is (GNU sed) 4.9.

If there is a better way to extract a part of string then please mention that also.


Solution

  • tl;dr

    Workarounds, in descending order of preference (note that echo "d6121090" was simplified to just 'd6121090', using PowerShell's implicit output behavior - see this answer):

    • Use '...' quoting embedded inside "...":

      'd6121090' | sed -E "'s/^d6(.{0,2})10.*/\1/'"
      
      • Note: This takes advantage of the fact that - unusually - the Cygwin-provided executables also recognize '...' quoting on their command lines.
        • Using '"..."' instead (quotes swapped) - surprisingly - works as-is up to PowerShell v7.2.x; it should never have worked, but did due to a long-standing bug with respect to passing arguments with embedded " chars. to external programs. This was (mostly) fixed in v7.3, where you can opt-into the old, broken behavior with $PSNativeCommandArgumentPassing = 'Legacy'; see this answer for details.
    • Use an extra space to force PowerShell to quote the sed script on the process command line built behind the scenes (see next section).

      'd6121090' | sed -E 's/^d6(.{0,2})10.*/\1/ ' # <- Note the trailing space
      
    • Call via cmd.exe, which allows you to control quoting explicitly (the spaces around " are just for readability).

      'd6121090' | cmd /c " sed -E 's/^d6(.{0,2})10.*/\1/' "
      
    • Use --%, the stop-parsing token to pass the remainder of the command line as-is, but note its fundamental limitations, discussed in the bottom section of this answer.

      'd6121090' | sed -E --% 's/^d6(.{0,2})10.*/\1/'
      

    Background information:

    You're seeing the confluence of two surprising behaviors:

    • When you invoke a Cygwin utility such as sed from a Windows shell such as PowerShell, the command line is interpreted as if it had been submitted from a POSIX-compatible shell such as Bash.

      • On the plus side, this allows you to use '...' quoting even when calling from cmd.exe, even though CLIs on Windows normally only understand "..." quoting.
    • PowerShell - of necessity (on Windows) - rebuilds the command line it uses to launch child processes behind the scenes, because its own command-line syntax - notably the ability to use '...' strings - cannot be expected to be understood by outside CLIs (which on Windows are expected to recognize "..." strings only).

      • However, PowerShell employs on-demand double-quoting when it rebuilds the command line, based solely on whether an argument contains spaces. Therefore, what was verbatim "s/^d6(.{0,2})10.*/\\1/" on the original command line is placed as verbatim s/^d6(.{0,2})10.*/\\1/ - without quoting! - on the process command line.

      • This is normally not a problem, given that most CLIs use their arguments verbatim rather than subjecting them to shell-like interpretation.

        • It is, however, also a problem with cmd.exe, both in direct cmd /c calls as well indirectly, when calling batch files; GitHub issue #15143, among other suggested improvements, proposed modifying PowerShell to accommodate this quirk and double-quote even space-less arguments if they contain cmd.exe metacharacters, but it looks like such improvements won't be implemented.

    Therefore, the command line that PowerShell actually submits is the following - note the absence of quoting:

    sed -E s/^d6(.{0,2})10.*/\\1/
    

    The lack of quoting around the sed script causes Cygwin to treat { and } as Bash brace expansion expression, which therefore expands to multiple arguments, with the extra argument getting interpreted as a - non-existent - filename.

    You can verify this as follows, using Cygwin's printf.exe:

    # From PowerShell
    printf '%s\n' 's/^d6(.{0,2})10.*/\1/'
    

    Output:

    s/^d6(.0)10.*/1/
    s/^d6(.2)10.*/1/
    

    That is, the brace expansion turned .{0, 2} into two arguments, .0 and .2, along with the prefix and suffix string, and the second argument was then interpreted as a filename.


    There are several workarounds, as shown above, but the simplest one in this case is to append a space to your sed script, which doesn't interfere with the script's function, but forces PowerShell to enclose the script in "..." behind the scenes (in cases where adding an extra space would interfere with the intended functionality, such as when passing a space-less search pattern to grep, use the "'...'" technique shown at the top):

    'd6121090' | sed -E 's/^d6(.{0,2})10.*/\1/ ' # <- Note the trailing space
    

    Note:

    • '...' quoting is used on the PowerShell side, which is generally preferable when you're dealing with verbatim (literal) values.

    • The \ before \1 is not escaped; to PowerShell itself, \ is never special.

      • See the bottom section for more information about PowerShell string literals.
    • Even though the '...' quoting gets translated to "..." quoting behind the scenes, \ also does not require escaping on the Cygwin side (though doing so would also work). This, along with the fact that an unquoted argument such as s/^d6(.{0,2})10.*/\1/ would cause a syntax error on the Bash command line - due to the unescaped ( and ) - suggests that Cygwin employs some kind of hybrid approach to parsing the command line (presumably built into each and every .exe that Cygwin comes with).


    if there is a better way to extract a part of string then please mention that also.

    PowerShell has great regex support built in, and its -replace operator allows you to do what sed's s/// function does, only more efficiently, because the operation is performed in-process:

    # From PowerShell
    'd6121090' -replace '^d6(.{0,2})10.*', '$1' # -> '12'
    

    PowerShell's string literals and escaping:

    • In PowerShell's expandable (double-quoted) strings ("..."), \ has no special meaning (and neither do { and }).

      • PowerShell's escape characters is `, the so-called backtick, and inside "..." only it and $ have special meaning (the latter for referencing variables and subexpressions to be expanded (interpolated)) and therefore need escaping with ` if meant to be used verbatim.

      • Similar to POSIX-compatible shells such as Bash, PowerShell also has verbatim (single-quoted) strings ('...') '...' is generally preferable when expressing regexes or substitution expressions.