Search code examples
awk

decoding octal escape sequences with awk


Let's suppose that you got octal escape sequences in a stream:

backslash \134 is escaped as \134134
single quote ' and double quote \042
linefeed `\012` and carriage return `\015`
%s &
etc...

note: In my input the escaped characters are limited to 0x01-0x1F 0x22 0x5C 0x7F

How can you revert those escape sequences back to their corresponding character with awk?

While awk is able to understand them out-of-box when used in a literal string or as parameter argument, I can't find the way to leverage this capability when the escape sequence is part of the data. For now I'm using one gsub per escape sequence but it doesn't feel efficient.

Here's the expected output for the given sample:

backslash \ is escaped as \134
single quote ' and double quote "
linefeed `
` and carriage return `
%s &
etc...

Solution

  • Using GNU awk for strtonum() and lots of meaningfully-named variables to show what each step does:

    $ cat tst.awk
    function octs2chars(str,        head,tail,oct,dec,char) {
        head = ""
        tail = str
        while ( match(tail,/\\[0-7]{3}/) ) {
            oct  = substr(tail,RSTART+1,RLENGTH-1)
            dec  = strtonum(0 oct)
            char = sprintf("%c", dec)
            head = head substr(tail,1,RSTART-1) char
            tail = substr(tail,RSTART+RLENGTH)
        }
        return head tail
    }
    { print octs2chars($0) }
    

    $ awk -f tst.awk file
    backslash \ is escaped as \134
    single quote ' and double quote "
    linefeed `
    ` and carriage return `
    %s &
    etc...
    

    If you don't have GNU awk then write a small function to convert octal to decimal, e.g. oct2dec() below, and then call that instead of strtonum():

    $ cat tst2.awk
    function oct2dec(oct,   dec) {
        dec =  substr(oct,1,1) * 8 * 8
        dec += substr(oct,2,1) * 8
        dec += substr(oct,3,1)
        return dec
    }
    
    function octs2chars(str,        head,tail,oct,dec,char) {
        head = ""
        tail = str
        while ( match(tail,/\\[0-7]{3}/) ) {
            oct  = substr(tail,RSTART+1,RLENGTH-1)
            dec  = oct2dec(oct)        # replaced "strtonum(0 oct)"
            char = sprintf("%c", dec)
            head = head substr(tail,1,RSTART-1) char
            tail = substr(tail,RSTART+RLENGTH)
        }
        return head tail
    }
    { print octs2chars($0) }
    

    $ awk -f tst2.awk file
    backslash \ is escaped as \134
    single quote ' and double quote "
    linefeed `
    ` and carriage return `
    %s &
    etc...
    

    The above assumes that, as discussed in comments, the only backslashes in the input will be in the context of the start of octal numbers as shown in the provided sample input.