I have a large tab delimited file (dummy.vcf) with a column of ';' delimited variables. For example:
AF_female=0.00000e+00;non_topmed_AF_female=0.00000e+00;control_AF_female=0.00000e+00
control_AF_female=0.00000e+00;non_topmed_AF_female=0.00000e+00
AF_female=0.00008e+00;non_topmed_AF_female=0.00000e+00
I would like to extract the "AF_female=X" string for each row with missing values filled in, so the new file is the same length as the original. For example:
AF_female=0.00000e+00
NA
AF_female=0.00008e+00
I have tried:
grep -o ';AF_female=[0-9].[0-9]*..[0-9]*' dummy.vcf
However, this does not add rows for when the pattern is not matched.
Any help will be very much appreciated!
could you please try following if you are ok with awk
.
awk -F';' '
{
val=""
for(i=1;i<=NF;i++){
if($i ~ /^AF_female=[0-9]+/){
val=(val?val OFS $i:$i)
}
}
if(val){
print val
}
else{
print "NA"
}
}' Input_file
It should check all present values of AF_female=digits
in a line and will print NA
in case it finds NULL matches on a line too.
Output will be as follows.
AF_female=0.00000e+00
NA
AF_female=0.00008e+00
Explanation: Adding explanation for above command now.
awk -F';' ' ##Starting awk program here and setting up field separator as semi-colon here.
{
val="" ##Nullifying value of variable val here.
for(i=1;i<=NF;i++){ ##using a for loop which starts from i=1 to i=NF value. Where NF is number of fields value in current line.
if($i ~ /^AF_female=[0-9]+/){ ##Checking condition if a field starts from AF_female and then digits then do following.
val=(val?val OFS $i:$i) ##Creating variable val whose value is current field value and concatenating its own value.
}
}
if(val!=""){ ##After coming out of for loop checking if variable val value is NOT NULL then do following.
print val ##Printing value of variable val here.
}
else{ ##Mentioning else of above if condition here.
print "NA" ##Printing NA here.
}
}' Input_file ##Mentioning Input_file name here.