Search code examples
awkuniqvcf-variant-call-format

Is there a Linux command for string subtraction between columns?


I'm processing some SNP column into vcf format.

the input columns are as following :

ref     ALT 
A       A G 
A       A T 
T       C T 
G       G T 
A       A G 
C       C G T 
G       A G 
T       C T 
T       A G T

expected output :

ref     ALT
A       G
A       T
T       C
G       T
A       G
C       G,T
G       A
T       C
T       A,G

Solution

  • $ awk 'BEGIN{FS=OFS="\t"} NR>1{sub($1," ",$2); gsub(/^ +| +$/,"",$2); gsub(/ +/,",",$2)} 1' file
    ref     ALT
    A       G
    A       T
    T       C
    G       T
    A       G
    C       G,T
    G       A
    T       C
    T       A,G
    

    The above will only work when $1 doesn't contain RE metachars and can't be a substring of any of the strings in $2.