Search code examples
bashunixbioinformaticsvcf-variant-call-format

pull string from a vcf file using awk


i'm running the following code to manipulate numeric data in a vcf table.

 cat inputfile | while read row; do
                echo $row > tmp
                originalProb= `awk '{print $1}' tmp`
                probabilityHom1=`awk '{print $2}' tmp`
                probabilityHom2=`awk '{print $4}' tmp`
                numCols=`awk '{print NF}' tmp`

                if [ $numCols -gt 4 ]; then
                        echo "${originalProb}" >> currentRowGenotypes
                elif [ "$probabilityHom1" -gt "$probabilityHom2" ]; then
                        echo "1/1" >> currentRowGenotypes
                elif [ "$probabilityHom1" -lt "$probabilityHom2" ]; then
                        echo "0/0" >> currentRowGenotypes
                elif [ "$probabilityHom1" -eq "$probabilityHom2" ] && [ "$probabilityHom1" -eq 0 ]; then
                        echo "${originalProb}" >> currentRowGenotypes
                else                    
                        echo "het" >> currentRowGenotypes
                fi

        done

        cat tmpHeaders currentRowGenotypes > currentFullCol

the input file looks like this

1/1     255     231     0
0/1     255     0       152
0/1     255     0       82
0/1     255     0       151
0/1     239     0       31
0/1     255     0       255

for some reason the awk command doesn't recognize the first column. any suggestions ?


Solution

  • It is not a good idea to create a temporary file just to make awk split the line into columns because:

    • It causes an overhead to create a temporary file line by line.
    • It spawns child processes multiple times to invoke awk.
    • The difference of the syntax between bash and awk may be the cause of mistake.

    You can do it without using awk. Please try the following:

    while read -ra row; do
        originalProb="${row[0]}"
        probabilityHom1="${row[1]}"
        probabilityHom2="${row[3]}"
        numCols="${#row}"
    
        if (( numCols > 4 )); then
            echo "$originalProb" >> currentRowGenotypes
        elif (( probabilityHom1 > probabilityHom2 )); then
            echo "1/1" >> currentRowGenotypes
        elif (( probabilityHom1 < probabilityHom2 )); then
            echo "0/0" >> currentRowGenotypes
        elif (( probabilityHom1 == probabilityHom2 &&  probabilityHom1 == 0 )); then
            echo "$originalProb" >> currentRowGenotypes
        else
            echo "het" >> currentRowGenotypes
        fi
    done < inputfile
    
    cat tmpHeaders currentRowGenotypes > currentFullCol
    

    As others repeatedly suggest, a better way will be to write with awk:

    awk '{
        originalProb = $1
        probabilityHom1 = $2
        probabilityHom2 = $4
        numCols = NF
    
        if ( numCols > 4 )
            print originalProb >> "currentRowGenotypes"
        else if ( probabilityHom1 > probabilityHom2 )
            print "1/1" >> "currentRowGenotypes"
        else if ( probabilityHom1 < probabilityHom2 )
            print "0/0" >> "currentRowGenotypes"
        else if ( probabilityHom1 == probabilityHom2 && probabilityHom1 == 0 )
            print originalProb >> "currentRowGenotypes"
        else
            print "het" >> "currentRowGenotypes"
    }' inputfile
    
    cat tmpHeaders currentRowGenotypes > currentFullCol
    

    Hope this helps.