I have 1M word vectors in fasttext format (ignoring the first line containing vocab size and dim). Every line is a word followed by 300 numbers, all space separated, ex.
Word 1.00 0.50 -2.30
WORD 0.90 0.40 -2.20
How can I keep the first line a word appears in, ignoring case, and remove all further lines? For example, since Word
appeared first, the line with WORD
is deleted and the output is
Word 1.00 0.50 -2.30
I can use tr '[:upper:]' '[:lower:]' < wiki-news-300d-1M.vec
to convert all words to lowercase, but that ruins the cases of words. I know how to remove all duplicate lines if the entire line including the numbers matches, but that is not useful here. My python solution would be to keep a dict storing the lowercase of each word, and check each line's word against that dict, but I am curious about a awk/sed (or even grep) solution.
Use tolower($1)
as the key in an array in awk
.
awk '!a[tolower($1)]++' wiki-news-300d-1M.vec