Search code examples
awktext-processingmawk

Need to remove duplicate lines using mawk (specifically)


I have a gawk command that works fine. But I have a machine with mawk installed and when I try to install gawk it complains about broken dependencies. I would like to change this line to mawk syntax.

awk -F '[|]{3}' 'BEGIN {OFS="|||"} !seen[$4]++ {print $4,$7,$3,$5,$6,$8,$9,$10,$11}' $1

Input File: It is a Three Pipe delimited file

A|||B|||C|||D|||E|||F|||G|||H|||I|||J|||K||||L|||M|||N|||O|||P|||Q|||R|||S||||T|||U
1|||2|||3|||4|||5|||6|||7|||8|||9|||10|||11|||12|||13|||14|||15|||16|||17|||18|||19

Solution

  • POSIX awk makes use of extended regular expressions which have the possiblility to define character duplication by means of {m,n}

    When an ERE matching a single character or an ERE enclosed in parentheses is followed by an interval expression of the format {m}, {m,}, or {m,n}, together with that interval expression it shall match what repeated consecutive occurrences of the ERE would match. The values of m and n are decimal integers in the range 0 <= m<= n<= {RE_DUP_MAX}, where m specifies the exact or minimum number of occurrences and n specifies the maximum number of occurrences. The expression {m} matches exactly m occurrences of the preceding ERE, {m,} matches at least m occurrences, and {m,n} matches any number of occurrences between m and n, inclusive.

    source: POSIX Regular Expressions

    This method of duplication is unfortunately not supported by mawk as can be read from the manual (Section 3 Regular Expressions).

    So instead of defining the field separator FS by means of -F '[|]{3}', you have to make use of -F '[|][|][|]' or -F "\\|\\|\\|"