How to call bash script in nextflow and manage multiple inputs

I am trying to integrate a bash script into my nextflow pipeline. The bash script reads bed files from two directories and performs intersect on them. My directory structure before running the pipeline is as follows:

results
├── extract_gimme
│   ├── gimme_scan_Ss1
│   ├── gimme_scan_Ss2
│   ├── gimme_scan_Ss3
│   ├── gimme_scan_Ss4
│   └── gimme_scan_Ss5
├── gimme_scan
├── motifs_nr_coord

My bash script is as follows:

# get list of extract_gimme/ & write to txt
ls $1 > dir_list.txt

echo "intersect starts here!"
echo "------------------------"
echo " "

#now run loop for each dir inside dir_list.txt
while read DIR;
do

# declare the two directories for intersect
bed_dir="$1$DIR"
ls $bed_dir > bed_files.txt
coord_dir="results/motifs_nr_coord"

#create directory to save outputs for each DIR
DIR_A="results/motifs_loc"
DIR_B=$(echo $DIR| cut -d'_' -f 3) #this will  get Ss[12345]
motif_dir="$DIR_A"/"$DIR_B"
mkdir -p $motif_dir # to output results

# process each sub dir using loop & run intersect
  while read BED;
  do
   bedtools intersect -a $bed_dir"/"$BED -b $coord_dir"/"$BED -wa | sort | uniq > $motif_dir"/"$BED"_intersected.bed"
  done < bed_files.txt

rm bed_files.txt
echo "sample processing for $DIR_B done!!"
echo " "
echo "**********************"

done  < dir_list.txt
rm dir_list.txt

below is my very dry nextflow pipeline:

 #!/usr/bin/env nextflow
    nextflow.enable.dsl=2
    
    params.gimme_scan_dir = "results/extract_gimme"
    
    process INTERSECT {
        publishDir "results", mode: 'copy', overwrite: false
    
        input:
        path scan_dir
        
        output:
        path '*'
    
        script:
        """
        $baseDir/intersect_scan_coord.sh $scan_dir
        """
    }
    
    workflow {
        scan_dir = Channel.fromPath(params.gimme_scan_dir, type: 'dir')
        INTERSECT(scan_dir)
    
    }

I use this command to run the bash on my terminal:

. myscript.sh results/extract_gimme

The script runs fine on terminal and outputs intersect of bed files from each of the sub dir in extract_gimme (i.e. Ss1, Ss2...) with motifs_nr_coord.

But when I use the same script inside the nextflow pipeline, it doesn't output the intersected files, and rather creates the empty but expected directories.

nextflow run test.nf results/extract_gimme

Here is the directory structure after I run the nextflow pipeline.

results
├── extract_gimme
│   ├── gimme_scan_Ss1
│   ├── gimme_scan_Ss2
│   ├── gimme_scan_Ss3
│   ├── gimme_scan_Ss4
│   └── gimme_scan_Ss5
├── gimme_scan
├── motifs_nr_coord
└── results
    └── motifs_loc
        ├── Ss1
        ├── Ss2
        ├── Ss3
        ├── Ss4
        └── Ss5

I am assuming need to include bed files into input channel but don't know how as it involves dir and sub dir.

I could use some pointers on how to implement the bash script into my nextflow pipeline.

thanks for your help.

Solution

I think a simpler solution is to just pass in the gimme_scan BED files and coordinate BED files and create a channel for each. Note that we can use the fromFilePairs factory method to try to extract the sample/scan ID from the parent directory. I've just used a shell block here to embed the script, but you could replace this for a script block that runs a shell script like what you had originally:

params.gimme_scan_bed_files = './results/extract_gimme/*/*.bed'
params.coord_bed_files = './results/motifs_nr_coord/*.bed'

params.outdir = './outdir'

process INTERSECT {

    tag { scan_id }

    publishDir "${params.outdir}/intersect", mode: 'copy'

    input:
    tuple val(scan_id), path('bed_dir/*')
    path 'coord_dir/*'

    output:
    tuple val(scan_id), path("${scan_id}/*")

    shell:
    '''
    mkdir "!{scan_id}"
    for bed in bed_dir/*; do

        bedtools intersect \\
            -a "${bed}" \\
            -b "coord_dir/$(basename "${bed}")" \\
            -wa |
        sort \\
            -u \\
            > "!{scan_id}/$(basename "${bed}" '.bed')_intersected.bed"
    done
    '''
}

workflow {

    Channel
        .fromFilePairs( params.gimme_scan_bed_files, size: -1) {
            it.parent.name.substring(it.parent.name.lastIndexOf('_') + 1)
        }
        .set { gimme_scan_bed_files }

    Channel
        .fromPath( params.coord_bed_files )
        .collect()
        .set { coord_bed_files }

     INTERSECT( gimme_scan_bed_files, coord_bed_files )

     INTERSECT.out.view()
}