Search code examples
regexbashlarge-data

How to extract a pattern but fill missing values in bash?


I have a large tab delimited file (dummy.vcf) with a column of ';' delimited variables. For example:

AF_female=0.00000e+00;non_topmed_AF_female=0.00000e+00;control_AF_female=0.00000e+00
control_AF_female=0.00000e+00;non_topmed_AF_female=0.00000e+00
AF_female=0.00008e+00;non_topmed_AF_female=0.00000e+00

I would like to extract the "AF_female=X" string for each row with missing values filled in, so the new file is the same length as the original. For example:

AF_female=0.00000e+00  
NA  
AF_female=0.00008e+00 

I have tried:

grep -o ';AF_female=[0-9].[0-9]*..[0-9]*' dummy.vcf

However, this does not add rows for when the pattern is not matched.

Any help will be very much appreciated!


Solution

  • could you please try following if you are ok with awk.

    awk -F';' '
    {
      val=""
      for(i=1;i<=NF;i++){
         if($i ~ /^AF_female=[0-9]+/){
             val=(val?val OFS $i:$i)
         }
      }
      if(val){
         print val
      }
      else{
         print "NA"
      }
    }'  Input_file
    

    It should check all present values of AF_female=digits in a line and will print NA in case it finds NULL matches on a line too.

    Output will be as follows.

    AF_female=0.00000e+00
    NA
    AF_female=0.00008e+00
    

    Explanation: Adding explanation for above command now.

    awk -F';' '                           ##Starting awk program here and setting up field separator as semi-colon here.
    {
      val=""                              ##Nullifying value of variable val here.
      for(i=1;i<=NF;i++){                 ##using a for loop which starts from i=1 to i=NF value. Where NF is number of fields value in current line.
         if($i ~ /^AF_female=[0-9]+/){    ##Checking condition if a field starts from AF_female and then digits then do following.
             val=(val?val OFS $i:$i)      ##Creating variable val whose value is current field value and concatenating its own value.
         }
      }
      if(val!=""){                        ##After coming out of for loop checking if variable val value is NOT NULL then do following.
         print val                        ##Printing value of variable val here.
      }
      else{                               ##Mentioning else of above if condition here.
         print "NA"                       ##Printing NA here.
      }
    }' Input_file                         ##Mentioning Input_file name here.