Search code examples
bashsedpastecutvcf-variant-call-format

Bash: text processing command


I have been able to do what I want with one command one line, but I do know there must be some more elegant way to do what I am doing. Please tell me what your methods are... I would like to learn more sophisticated way of processing text files...

Original file is a vcf file looks like this

##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##fileDate=20180307
##source=PLINKv1.90
##contig=<ID=1,length=249214117>
##contig=<ID=2,length=242842533>
##contig=<ID=3,length=197896741>
...
...
...
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT
22  16258171    22:16258171:D:3 A   .   .   .   .   GT
22  16258174    22:16258174:T:C T   .   .   .   .   GT
22  16258183    22:16258183:A:T A   .   .   .   .   GT
22  16258189    22:16258189:G:T G   .   .   .   .   GT

My goal is to generate a file looks like this:

22  16258171  16258171  D  3
22  16258174  16258174  T  C
22  16258183  16258183  A  T
22  16258189  16258189  G  T
22  16258211  16258211  A  G
22  16258211  16258211  A  T
22  16258220  16258220  T  G
22  16258221  16258221  C  T
22  16258224  16258224  C  T
22  16258227  16258227  G  A

I did the following steps to achieve the final goal but it's so cumbersome and so ugly too...

#remove comments
sed '/^[[:blank:]]*#/d;s/#.*//' chr22.vcf > no_comment_chr22.vcf

#take out the third columns for splitting
cut -d $'\t' -f 3 no_comment_chr22.vcf > no_comment_chr22.col3_to_split.txt

#Split string by delimiter and get N-th element, use as col4
cut -d':' -f3 no_comment_chr22.col3_to_split.txt > chr22_as_col4.txt

#Split string by delimiter and get N-th element, use as col5
cut -d':' -f4 no_comment_chr22.col3_to_split.txt > chr22_as_col5.txt

#get first 2 columns
cut -d $'\t' -f 1-2 no_comment_chr22.vcf > no_comment_chr22.col1to2.txt

#get the second column as col3 
cut -d $'\t' -f 2 no_comment_chr22.vcf > no_comment_chr22.ascol3.txt

#Combine files column-wise
paste no_comment_chr22.col1to2.txt no_comment_chr22.ascol3.txt chr22_as_col4.txt chr22_as_col5.txt | column -s $'\t' -t  > chr22_input_5cols.txt

I was able to get what I need but .. gahhh, this is so ugly. Please tell me what people do to advance their text processing skills and how to improve things like this.. thank you!!


Solution

  • Using awk:

    awk -F'(:| +)' '/^#/ {next} {print $1,$2,$4,$5,$6}' sample.vcf
    
    
    22 16258171 16258171 D 3
    22 16258174 16258174 T C
    22 16258183 16258183 A T
    22 16258189 16258189 G T
    

    This is specifying a regular expression as the filed delimiter (-F) and then ignoring the comment lines (^#) or printing the corresponding fields (1,2,4,5,6).