Search code examples
bashgnu-parallelvcf-variant-call-format

GNU parallel read from several files


I am trying to use GNU parallel to convert individual files with a bioinformatic tool called vcf2maf.

My command looks something like this:

${parallel} --link "perl ${vcf2maf} --input-vcf ${1} \
                                    --output-maf ${maf_dir}/${2}.maf \
                                    --tumor-id ${3} \
                                    --tmp-dir ${vcf_dir} \
                                    --vep-path ${vep_script} \
                                    --vep-data ${vep_data} \
                                    --ref-fasta ${fasta} \
                                    --filter-vcf ${filter_vcf}" :::: ${VCF_files} ${results} ${tumor_ids}

VCF_files, results and tumor_ids contain one entry per line and correspond to one another.

When I try and run the command I get the following error for every file:

ERROR: Both input-vcf and output-maf must be defined!

This confused me, because if I run the command manually, the program works as intended, so I dont think that the input/outpit paths are wrong. To confirm this, I also ran

${parallel} --link "cat ${1}" :::: ${VCF_files} ${results} ${tumor_ids}, which correctly prints the contents of the VCF files, whose path is listed in VCF_files.

I am really confused what I did wrong, if anyone could help me out, I'd be very thankful!

Thanks!


Solution

  • For a command this long I would normally define a function:

    doit() {
      ...
    }
    export -f doit
    

    Then test this on a single input.

    When it works:

    parallel --link doit :::: ${VCF_files} ${results} ${tumor_ids}
    

    But if you want to use a single command it will look something like:

    ${parallel} --link "perl ${vcf2maf} --input-vcf {1} \
                                    --output-maf ${maf_dir}/{2}.maf \
                                    --tumor-id {3} \
                                    --tmp-dir ${vcf_dir} \
                                    --vep-path ${vep_script} \
                                    --vep-data ${vep_data} \
                                    --ref-fasta ${fasta} \
                                    --filter-vcf ${filter_vcf}" :::: ${VCF_files} ${results} ${tumor_ids}
    

    GNU Parallel's replacement strings are {1}, {2}, and {3} - not ${1}, ${2}, and ${3}.

    --dryrun is your friend when GNU Parallel does not do what you expect it to do.