I ran the following Awk script to get fastText vectors on my Ubuntu 22.04.2 LTS (Jammy Jellyfish). However, I always get the same error code: awk: lines 5 and 13: unexpected character 0xe2
$ awk -f combine.awk
BEGIN{
infile = "adjectives.txt"
while (getline < infile > 0) {
INCLUDE[$1]=1
}
close(infile)
infile = "cc.en.300.vec"
outfile = "fasttextvectors_adjectives.txt"
system("rm " outfile)
while (getline < infile > 0) {
if ($1 in INCLUDE) print >> outfile
}
close(infile)
close(outfile)
}
**I suspect there is something in the Awk script code itself, but I have seen someone use the same script in their Mac and being able to run it. Is it something about Ubuntu?
I've already tried:**
Still, I always get the same error:
INCLUDE[$1]=1
if ($1 in INCLUDE) print >> outfile
Any help would greatly appreciated. Also, I am student and just a beginner with Word Embeddings and Vectors.
Thank Youuu!
The problem isn't unexpected characters in your input, it's unexpected characters in your script, probably DOS line endings added by whatever editor you used to create it. See Why does my tool output overwrite itself and how do I fix it? for how to identify and handle those. And no, $
is not an unexpected character.
Aside from the problem you're asking about, your awk script has multiple issues:
BEGIN{ # 1
infile = "adjectives.txt"
while (getline < infile > 0) { # 2
INCLUDE[$1]=1 # 3
}
close(infile)
infile = "cc.en.300.vec"
outfile = "fasttextvectors_adjectives.txt"
system("rm " outfile) # 4
while (getline < infile > 0) { # 2 again
if ($1 in INCLUDE) print >> outfile # 5
}
close(infile)
close(outfile)
}
The issues with the above are:
awk 'NR==FNR{a[$1]; next} $1 in a' adjectives.txt cc.en.300.vec > fasttextvectors_adjectives.txt
while (getline < infile > 0)
is ambiguous, different awks could
read that as while ((getline < infile) > 0)
or while (getline < (infile > 0))
or something else. You MUST write a while-getline lop
as while ((getline < infile) > 0)
so you're guaranteed to be
testing the return code from getline
.INCLUDE[$1]=1
- don't use all upper case user-defined variable names so they don't clash with builtin variable names and so they don't look like you're using builtin variable names and so obfuscate your code. You also don't need to set the array content to 1
for how you're using it, just INCLUDE[$1]
would be enough.system("rm " outfile)
is passing the value stored in outfile
to the shell unquoted, it should be system("rm \047" outfile "\047")
so what the shell sees is rm 'fasttextvectors_adjectives.txt'
. You should consider just doing printf "" > outfile
instead though if you just want to empty a file to avoid neding to create a subshell with system()
.>>
in awk does not mean the same as it does in shell
(see the awk man page). Change >>
to >
and then you won't need to remove or otherwise initialize outfile before the loop unless you're worried about not overwriting a previous output file if there's no input this time.