I am trying to integrate a bash script into my nextflow pipeline. The bash script reads bed files from two directories and performs intersect on them. My directory structure before running the pipeline is as follows:
results
├── extract_gimme
│ ├── gimme_scan_Ss1
│ ├── gimme_scan_Ss2
│ ├── gimme_scan_Ss3
│ ├── gimme_scan_Ss4
│ └── gimme_scan_Ss5
├── gimme_scan
├── motifs_nr_coord
My bash script is as follows:
# get list of extract_gimme/ & write to txt
ls $1 > dir_list.txt
echo "intersect starts here!"
echo "------------------------"
echo " "
#now run loop for each dir inside dir_list.txt
while read DIR;
do
# declare the two directories for intersect
bed_dir="$1$DIR"
ls $bed_dir > bed_files.txt
coord_dir="results/motifs_nr_coord"
#create directory to save outputs for each DIR
DIR_A="results/motifs_loc"
DIR_B=$(echo $DIR| cut -d'_' -f 3) #this will get Ss[12345]
motif_dir="$DIR_A"/"$DIR_B"
mkdir -p $motif_dir # to output results
# process each sub dir using loop & run intersect
while read BED;
do
bedtools intersect -a $bed_dir"/"$BED -b $coord_dir"/"$BED -wa | sort | uniq > $motif_dir"/"$BED"_intersected.bed"
done < bed_files.txt
rm bed_files.txt
echo "sample processing for $DIR_B done!!"
echo " "
echo "**********************"
done < dir_list.txt
rm dir_list.txt
below is my very dry nextflow pipeline:
#!/usr/bin/env nextflow
nextflow.enable.dsl=2
params.gimme_scan_dir = "results/extract_gimme"
process INTERSECT {
publishDir "results", mode: 'copy', overwrite: false
input:
path scan_dir
output:
path '*'
script:
"""
$baseDir/intersect_scan_coord.sh $scan_dir
"""
}
workflow {
scan_dir = Channel.fromPath(params.gimme_scan_dir, type: 'dir')
INTERSECT(scan_dir)
}
I use this command to run the bash on my terminal:
. myscript.sh results/extract_gimme
The script runs fine on terminal and outputs intersect of bed files from each of the sub dir in extract_gimme (i.e. Ss1, Ss2...) with motifs_nr_coord.
But when I use the same script inside the nextflow pipeline, it doesn't output the intersected files, and rather creates the empty but expected directories.
nextflow run test.nf results/extract_gimme
Here is the directory structure after I run the nextflow pipeline.
results
├── extract_gimme
│ ├── gimme_scan_Ss1
│ ├── gimme_scan_Ss2
│ ├── gimme_scan_Ss3
│ ├── gimme_scan_Ss4
│ └── gimme_scan_Ss5
├── gimme_scan
├── motifs_nr_coord
└── results
└── motifs_loc
├── Ss1
├── Ss2
├── Ss3
├── Ss4
└── Ss5
I am assuming need to include bed files into input channel but don't know how as it involves dir and sub dir.
I could use some pointers on how to implement the bash script into my nextflow pipeline.
thanks for your help.
I think a simpler solution is to just pass in the gimme_scan BED files and coordinate BED files and create a channel for each. Note that we can use the fromFilePairs
factory method to try to extract the sample/scan ID from the parent directory. I've just used a shell
block here to embed the script, but you could replace this for a script block that runs a shell script like what you had originally:
params.gimme_scan_bed_files = './results/extract_gimme/*/*.bed'
params.coord_bed_files = './results/motifs_nr_coord/*.bed'
params.outdir = './outdir'
process INTERSECT {
tag { scan_id }
publishDir "${params.outdir}/intersect", mode: 'copy'
input:
tuple val(scan_id), path('bed_dir/*')
path 'coord_dir/*'
output:
tuple val(scan_id), path("${scan_id}/*")
shell:
'''
mkdir "!{scan_id}"
for bed in bed_dir/*; do
bedtools intersect \\
-a "${bed}" \\
-b "coord_dir/$(basename "${bed}")" \\
-wa |
sort \\
-u \\
> "!{scan_id}/$(basename "${bed}" '.bed')_intersected.bed"
done
'''
}
workflow {
Channel
.fromFilePairs( params.gimme_scan_bed_files, size: -1) {
it.parent.name.substring(it.parent.name.lastIndexOf('_') + 1)
}
.set { gimme_scan_bed_files }
Channel
.fromPath( params.coord_bed_files )
.collect()
.set { coord_bed_files }
INTERSECT( gimme_scan_bed_files, coord_bed_files )
INTERSECT.out.view()
}