Search code examples
awkfasttextunix-text-processing

Removing lines based on duplicate first word, ignoring case


I have 1M word vectors in fasttext format (ignoring the first line containing vocab size and dim). Every line is a word followed by 300 numbers, all space separated, ex.

Word 1.00 0.50 -2.30
WORD 0.90 0.40 -2.20

How can I keep the first line a word appears in, ignoring case, and remove all further lines? For example, since Word appeared first, the line with WORD is deleted and the output is

Word 1.00 0.50 -2.30

I can use tr '[:upper:]' '[:lower:]' < wiki-news-300d-1M.vec to convert all words to lowercase, but that ruins the cases of words. I know how to remove all duplicate lines if the entire line including the numbers matches, but that is not useful here. My python solution would be to keep a dict storing the lowercase of each word, and check each line's word against that dict, but I am curious about a awk/sed (or even grep) solution.


Solution

  • Use tolower($1) as the key in an array in awk.

    awk '!a[tolower($1)]++' wiki-news-300d-1M.vec