Search code examples
rbashsplitgziplarge-data

How to split a vcf.gz file based on the first column, keeping the header in each subset and save back to vcf.gz files


I have a large vcf.gz file (40GB) that I have to split to be able to load into R and run a script on each of the subset. I want to split it by the first column which worked with

zcat large_data.vcf.gz | cut -f1,2-5,8- | awk '{ print | ("gzip -c > " $1".vcf.gz") }'

But I would like to have the header saved in each subset. The header is not saved into a splitted data (what I thought it would do). It might be because the header starts with a #

#col1  col2  col3  col4  col5  col6  col7  col8

I tried

zcat large_data1.vcf.gz | cut -f1,2-5,8- | 
    awk 'NR == 1{header = $0; next} 
    !($1 in filename){ print header | (“gzip -c > “ $1 ".vcf.gz") } 
    NR > 1 { print $0 | (“gzip -c > “ $1 ".vcf.gz"); filename[$1] }' file

But something is wrong somewhere...

Any idea?
PS: -- filter is not a recognized option

Edit: an example of the data

#col1  col2  col3  col4  col5  col6  col7  col8
1  100  100  100  1000  110  100  110
1  110  100  110  500  200  150  160
2  140  120  100  1000  110  160  210
2  110  180  170  700  220  150  120

Desired data 1-

#col1  col2  col3  col4  col5  col6  col7  col8
1  100  100  100  1000  110  100  110
1  110  100  110  500  200  150  160

and 2-

#col1  col2  col3  col4  col5  col6  col7  col8
2  140  120  100  1000  110  160  210
2  110  180  170  700  220  150  120

Solution

  • You might have to save the header and the filenames in the awk program:

    zcat large_data.vcf.gz |
    cut -f1,2-5,8- |
    awk '
        BEGIN{ getline header }
        {
            filename = $1".vcf.gz"
            if (!seen[$1]) {
                print header | ("gzip -c > " filename)
                seen[$1]++
            }
            print | ("gzip -c > " filename)
        }
    '
    

    remark: why getline? because using NR==1 and NR>1 for a 40GB file is unnecessarily slower