Search code examples
awkbioinformaticssamtoolsbam

Split a SAM file in Awk keeping N number of lines as header


I have a very big Sequence Alignment Map (SAM) file as depicted below

 @X   YYYYYY ZZZZZ\
 @X   ssssss ddddd\
 @X   CCCCCC LLLLL
 
> FFFFFF    117 ch1 16448   0   *   =   16448   0   TCTTGCACTGATCTGATGGACAGCATTGATGACATAACACGGAGACTGTTGCTAAAAACATCCGATAAAACTCGTGCTCAGACACCAAATACTCAAGAAG    JJFEJDDDBDJJJHJDDDHDJJEFJDJJCDFDJEJCEHHFDDDJDJEHEEJFJJJHDIFJJJJJDJDDHHJCDDJJFJFJEJFEDJJJDH##########    MC:Z:55S22M23S  RG:Z:Sample_POP1    AS:i:0  XS:i:0
> FFFFFF    117 ch6 16448   0   *   =   16448   0   TCTTGCACTGATCTGATGGACAGCATTGATGACATAACACGGAGACTGTTGCTAAAAACATCCGATAAAACTCGTGCTCAGACACCAAATACTCAAGAAG    JJFEJDDDBDJJJHJDDDHDJJEFJDJJCDFDJEJCEHHFDDDJDJEHEEJFJJJHDIFJJJJJDJDDHHJCDDJJFJFJEJFEDJJJDH##########    MC:Z:55S22M23S  RG:Z:Sample_POP1    AS:i:0  XS:i:0
> FFFFFF    117 ch2 16448   0   *   =   16448   0   TCTTGCACTGATCTGATGGACAGCATTGATGACATAACACGGAGACTGTTGCTAAAAACATCCGATAAAACTCGTGCTCAGACACCAAATACTCAAGAAG    JJFEJDDDBDJJJHJDDDHDJJEFJDJJCDFDJEJCEHHFDDDJDJEHEEJFJJJHDIFJJJJJDJDDHHJCDDJJFJFJEJFEDJJJDH##########    MC:Z:55S22M23S  RG:Z:Sample_POP1    AS:i:0  XS:i:0
> FFFFFF    117 ch5 16448   0   *   =   16448   0   TCTTGCACTGATCTGATGGACAGCATTGATGACATAACACGGAGACTGTTGCTAAAAACATCCGATAAAACTCGTGCTCAGACACCAAATACTCAAGAAG    JJFEJDDDBDJJJHJDDDHDJJEFJDJJCDFDJEJCEHHFDDDJDJEHEEJFJJJHDIFJJJJJDJDDHHJCDDJJFJFJEJFEDJJJDH##########    MC:Z:55S22M23S  RG:Z:Sample_POP1    AS:i:0  XS:i:0
> FFFFFF    117 ch1 16448   0   *   =   16448   0   TCTTGCACTGATCTGATGGACAGCATTGATGACATAACACGGAGACTGTTGCTAAAAACATCCGATAAAACTCGTGCTCAGACACCAAATACTCAAGAAG    JJFEJDDDBDJJJHJDDDHDJJEFJDJJCDFDJEJCEHHFDDDJDJEHEEJFJJJHDIFJJJJJDJDDHHJCDDJJFJFJEJFEDJJJDH##########    MC:Z:55S22M23S  RG:Z:Sample_POP1    AS:i:0  XS:i:0

I want to split the file based on column 3 so I can do awk '{print > $3}' file.txt which is working fine. Now I want to keep the lines

 @X   YYYYYY ZZZZZ\
 @X   ssssss ddddd\
 @X   CCCCCC LLLLL

as header on top of all the splitted files, how can I do that?

I tried this:

awk '$1 ~ /^@/ {print > $3}'  file.txt

Solution

  • You have to keep track of whether the file is one you have seen before, and if not, write the header before you write to it for the first time.

    awk '$1 ~ /^@/ { header = header $0 ORS; next }
       !seen[$3]++ { printf "%s", header >$3 }
       { print > $3 }' file.txt
    

    The internal variable ORS usually contains a newline but it's customary to use the variable so that you only need to change the string in one place if you want to use a different output record separator.

    This can run out of file handles if you have more than a couple of dozen distinct values in $3 but if your script otherwise works, I guess that's not a problem in your case.

    (The brute-force fix is to close and reopen the file after each write, which makes the script much slower. A better fix if you have the memory is to collect all the results into RAM and only write when you have read all the data. A more sophisticated approach would keep a buffer of, say, 20 open file handles, and close the least recently used when you need to write to a file which isn't among them.)