Automated awk command to remove duplicated fields in a VCF's INFO column keeping the first occurrence

The presence of duplicated fields in my VCF files are causing a problem for other programs. A VCF file is a tab separated file. One cell of the INFO column is given below. The structure of the cell is:


I need a script to remove the duplicated fields repeated after first occurrence and replace the cell with:


The actual cell...



  • for example data:


    Assuming that always the first instance of the name is what you want to keep!

    echo "info1=x;info2=y;xyz=abc;info1=othervalue;info2=." |sed -e 's/;/\n/g' |awk -F= '{ if ($1 in words == 0) words[$1]=$2} END { for (w in words) printf"%s=%s;", w, words[w]}'

    the result:


    With you will adapt to your data.
    I used gawk v 4.0.2

    1. sed divides the string into lines at the place where the semicolon occurs,
    2. the -e option means the escape characters can be used
    3. -F = instructs AWK to divide lines into words at the = place
    4. {if ($ 1 in words == 0) words [$ 1] = $ 2} for each line, check if word 1 is in the words array and if not add an element of the word array with the word word1 with the value of word 2
    5. END {...} is done at the end, reads all elements of the array words and then printf prints the keys and values in the expected format