I am trying to use GNU parallel to convert individual files with a bioinformatic tool called vcf2maf.
My command looks something like this:
${parallel} --link "perl ${vcf2maf} --input-vcf ${1} \
--output-maf ${maf_dir}/${2}.maf \
--tumor-id ${3} \
--tmp-dir ${vcf_dir} \
--vep-path ${vep_script} \
--vep-data ${vep_data} \
--ref-fasta ${fasta} \
--filter-vcf ${filter_vcf}" :::: ${VCF_files} ${results} ${tumor_ids}
VCF_files
, results
and tumor_ids
contain one entry per line and correspond to one another.
When I try and run the command I get the following error for every file:
ERROR: Both input-vcf and output-maf must be defined!
This confused me, because if I run the command manually, the program works as intended, so I dont think that the input/outpit paths are wrong. To confirm this, I also ran
${parallel} --link "cat ${1}" :::: ${VCF_files} ${results} ${tumor_ids}
,
which correctly prints the contents of the VCF files, whose path is listed in VCF_files
.
I am really confused what I did wrong, if anyone could help me out, I'd be very thankful!
Thanks!
For a command this long I would normally define a function:
doit() {
...
}
export -f doit
Then test this on a single input.
When it works:
parallel --link doit :::: ${VCF_files} ${results} ${tumor_ids}
But if you want to use a single command it will look something like:
${parallel} --link "perl ${vcf2maf} --input-vcf {1} \
--output-maf ${maf_dir}/{2}.maf \
--tumor-id {3} \
--tmp-dir ${vcf_dir} \
--vep-path ${vep_script} \
--vep-data ${vep_data} \
--ref-fasta ${fasta} \
--filter-vcf ${filter_vcf}" :::: ${VCF_files} ${results} ${tumor_ids}
GNU Parallel's replacement strings are {1}, {2}, and {3} - not ${1}, ${2}, and ${3}.
--dryrun
is your friend when GNU Parallel does not do what you expect it to do.