Search code examples
bashpipelinenextflow

How to call bash script in nextflow and manage multiple inputs


I am trying to integrate a bash script into my nextflow pipeline. The bash script reads bed files from two directories and performs intersect on them. My directory structure before running the pipeline is as follows:

results
├── extract_gimme
│   ├── gimme_scan_Ss1
│   ├── gimme_scan_Ss2
│   ├── gimme_scan_Ss3
│   ├── gimme_scan_Ss4
│   └── gimme_scan_Ss5
├── gimme_scan
├── motifs_nr_coord

My bash script is as follows:

# get list of extract_gimme/ & write to txt
ls $1 > dir_list.txt

echo "intersect starts here!"
echo "------------------------"
echo " "

#now run loop for each dir inside dir_list.txt
while read DIR;
do

# declare the two directories for intersect
bed_dir="$1$DIR"
ls $bed_dir > bed_files.txt
coord_dir="results/motifs_nr_coord"

#create directory to save outputs for each DIR
DIR_A="results/motifs_loc"
DIR_B=$(echo $DIR| cut -d'_' -f 3) #this will  get Ss[12345]
motif_dir="$DIR_A"/"$DIR_B"
mkdir -p $motif_dir # to output results

# process each sub dir using loop & run intersect
  while read BED;
  do
   bedtools intersect -a $bed_dir"/"$BED -b $coord_dir"/"$BED -wa | sort | uniq > $motif_dir"/"$BED"_intersected.bed"
  done < bed_files.txt

rm bed_files.txt
echo "sample processing for $DIR_B done!!"
echo " "
echo "**********************"

done  < dir_list.txt
rm dir_list.txt

below is my very dry nextflow pipeline:

 #!/usr/bin/env nextflow
    nextflow.enable.dsl=2
    
    params.gimme_scan_dir = "results/extract_gimme"
    
    process INTERSECT {
        publishDir "results", mode: 'copy', overwrite: false
    
        input:
        path scan_dir
        
        output:
        path '*'
    
        script:
        """
        $baseDir/intersect_scan_coord.sh $scan_dir
        """
    }
    
    workflow {
        scan_dir = Channel.fromPath(params.gimme_scan_dir, type: 'dir')
        INTERSECT(scan_dir)
    
    }

I use this command to run the bash on my terminal:

. myscript.sh results/extract_gimme

The script runs fine on terminal and outputs intersect of bed files from each of the sub dir in extract_gimme (i.e. Ss1, Ss2...) with motifs_nr_coord.

But when I use the same script inside the nextflow pipeline, it doesn't output the intersected files, and rather creates the empty but expected directories.

nextflow run test.nf results/extract_gimme

Here is the directory structure after I run the nextflow pipeline.

results
├── extract_gimme
│   ├── gimme_scan_Ss1
│   ├── gimme_scan_Ss2
│   ├── gimme_scan_Ss3
│   ├── gimme_scan_Ss4
│   └── gimme_scan_Ss5
├── gimme_scan
├── motifs_nr_coord
└── results
    └── motifs_loc
        ├── Ss1
        ├── Ss2
        ├── Ss3
        ├── Ss4
        └── Ss5

I am assuming need to include bed files into input channel but don't know how as it involves dir and sub dir.

I could use some pointers on how to implement the bash script into my nextflow pipeline.

thanks for your help.


Solution

  • I think a simpler solution is to just pass in the gimme_scan BED files and coordinate BED files and create a channel for each. Note that we can use the fromFilePairs factory method to try to extract the sample/scan ID from the parent directory. I've just used a shell block here to embed the script, but you could replace this for a script block that runs a shell script like what you had originally:

    params.gimme_scan_bed_files = './results/extract_gimme/*/*.bed'
    params.coord_bed_files = './results/motifs_nr_coord/*.bed'
    
    params.outdir = './outdir'
    
    process INTERSECT {
    
        tag { scan_id }
    
        publishDir "${params.outdir}/intersect", mode: 'copy'
    
        input:
        tuple val(scan_id), path('bed_dir/*')
        path 'coord_dir/*'
    
        output:
        tuple val(scan_id), path("${scan_id}/*")
    
        shell:
        '''
        mkdir "!{scan_id}"
        for bed in bed_dir/*; do
    
            bedtools intersect \\
                -a "${bed}" \\
                -b "coord_dir/$(basename "${bed}")" \\
                -wa |
            sort \\
                -u \\
                > "!{scan_id}/$(basename "${bed}" '.bed')_intersected.bed"
        done
        '''
    }
    
    workflow {
    
        Channel
            .fromFilePairs( params.gimme_scan_bed_files, size: -1) {
                it.parent.name.substring(it.parent.name.lastIndexOf('_') + 1)
            }
            .set { gimme_scan_bed_files }
    
        Channel
            .fromPath( params.coord_bed_files )
            .collect()
            .set { coord_bed_files }
    
         INTERSECT( gimme_scan_bed_files, coord_bed_files )
    
         INTERSECT.out.view()
    }