Search code examples
awkunicodegrepbyte-order-mark

Why awk does not remove BOM from the middle of a line?


I try to use awk to remove all byte order marks from a file (I have many of them):

awk '{sub(/\xEF\xBB\xBF/,"")}{print}' f1.txt > f2.txt

It seems to remove all the BOMs that are in the beginning of the line but those in the middle are not removed. I can verify that by:

grep -U $'\xEF\xBB\xBF' f2.txt

Grep returns me one line where BOM is in the middle.


Solution

  • As mentioned sub() will only swap out the leftmost substring, so if global is what you're after then using gsub(), or even better gensub() is the way to go.

    sub(regexp, replacement [, target])

    Search target, which is treated as a string, for the leftmost, longest substring matched by the regular expression regexp. Modify the entire string by replacing the matched text with replacement. The modified string becomes the new value of target. Return the number of substitutions made (zero or one).

    gsub(regexp, replacement [, target])

    Search target for all of the longest, leftmost, nonoverlapping matching substrings it can find and replace them with replacement. The ‘g’ in gsub() stands for “global,” which means replace everywhere.

    gensub(regexp, replacement, how [, target]) #

    Search the target string target for matches of the regular expression regexp. If how is a string beginning with ‘g’ or ‘G’ (short for “global”), then replace all matches of regexp with replacement. Otherwise, "how" is treated as a number indicating which match of regexp to replace. gensub() is a general substitution function. Its purpose is to provide more features than the standard sub() and gsub() functions.

    There's tons more helpful information and examples linked below:

    The GNU Awk User's Guide: String Functions / 9.1.3 String-Manipulation Functions