Search code examples
vectorawkfasttext

Awk Script for fastText vectors - Error: "unexpected character 0xe2" when there's no such character


I ran the following Awk script to get fastText vectors on my Ubuntu 22.04.2 LTS (Jammy Jellyfish). However, I always get the same error code: awk: lines 5 and 13: unexpected character 0xe2

The Awk script that combines a .txt wordlist into a file with vectors:

$ awk -f combine.awk 

BEGIN{

 infile = "adjectives.txt"
 while (getline < infile > 0) {
   INCLUDE[$1]=1
 } 
 close(infile)
 
 infile = "cc.en.300.vec"
 outfile = "fasttextvectors_adjectives.txt"
 system("rm " outfile)
 while (getline < infile > 0) {
   if ($1 in INCLUDE) print >> outfile
 } 
 close(infile)
 close(outfile)
 


}

**I suspect there is something in the Awk script code itself, but I have seen someone use the same script in their Mac and being able to run it. Is it something about Ubuntu?

I've already tried:**

  • Making sure word list doesn't contain words with special characters at all;
  • Changing the .txt list UTF-8 encoding for Mac, Linux, Windows;
  • Making sure the file names also do not contain special characters.

Still, I always get the same error:

awk: lines 5 and 13: unexpected character 0xe2

There are no special characters in the word list itself

These are the lines 5 and 13 in the awk script (maybe the special character is '$'?):

INCLUDE[$1]=1
if ($1 in INCLUDE) print >> outfile

Any help would greatly appreciated. Also, I am student and just a beginner with Word Embeddings and Vectors.

Thank Youuu!


Solution

  • The problem isn't unexpected characters in your input, it's unexpected characters in your script, probably DOS line endings added by whatever editor you used to create it. See Why does my tool output overwrite itself and how do I fix it? for how to identify and handle those. And no, $ is not an unexpected character.

    Aside from the problem you're asking about, your awk script has multiple issues:

    BEGIN{                                     # 1
    
     infile = "adjectives.txt"
     while (getline < infile > 0) {            # 2
       INCLUDE[$1]=1                           # 3
     } 
     close(infile)
     
     infile = "cc.en.300.vec"
     outfile = "fasttextvectors_adjectives.txt"
     system("rm " outfile)                      # 4
     while (getline < infile > 0) {             # 2 again
       if ($1 in INCLUDE) print >> outfile      # 5
     } 
     close(infile)
     close(outfile)
    }
    

    The issues with the above are:

    1. The main issue is that you wrote your script as if you were writing a C program and so have written while-getline loops when awk has that functionality already built in so you're missing a large part of the reason to use awk. Written in idiomatic awk your whole script should just be:
    awk 'NR==FNR{a[$1]; next} $1 in a' adjectives.txt cc.en.300.vec > fasttextvectors_adjectives.txt
    
    1. while (getline < infile > 0) is ambiguous, different awks could read that as while ((getline < infile) > 0) or while (getline < (infile > 0)) or something else. You MUST write a while-getline lop as while ((getline < infile) > 0) so you're guaranteed to be testing the return code from getline.
    2. INCLUDE[$1]=1 - don't use all upper case user-defined variable names so they don't clash with builtin variable names and so they don't look like you're using builtin variable names and so obfuscate your code. You also don't need to set the array content to 1 for how you're using it, just INCLUDE[$1] would be enough.
    3. system("rm " outfile) is passing the value stored in outfile to the shell unquoted, it should be system("rm \047" outfile "\047") so what the shell sees is rm 'fasttextvectors_adjectives.txt'. You should consider just doing printf "" > outfile instead though if you just want to empty a file to avoid neding to create a subshell with system().
    4. awk is not shell and >> in awk does not mean the same as it does in shell (see the awk man page). Change >> to > and then you won't need to remove or otherwise initialize outfile before the loop unless you're worried about not overwriting a previous output file if there's no input this time.