Search code examples
javaregexgroovyregex-groupnextflow

renameing .fromFilePairs with regex capture group in closure


I'm new to nextflow/groovy/java and i'm running into some difficulty with a simple regular expression task.

I'm trying to alter the labels of some file pairs. It is my understanding that fromFilePairs returns a data structure of the form:

[
    [common_prefix, [file1, file2]],
    [common_prefix, [file3, file4]]
]

I further thought that:

  • The .name method when invoked on a item from this list will give the name, what I have labelled above as common_prefix
  • The value returned by a closure used with fromFilePairs sets the names of the file pairs.
  • The value of it in a closure used with fromFilePairs is a single item from the list of file pairs.

however, I have tried many variants on the following without success:

params.fastq = "$baseDir/data/fastqs/*_{1,2}_*.fq.gz"

Channel
    .fromFilePairs(params.fastq, checkIfExists:true) {
        file -> 
            // println file.name // returned the common file prefix as I expected
            mt = file.name =~ /(common)_(prefix)/
            // println mt 
            // # java.util.regex.Matcher[pattern=(common)_(prefix) region=0,47 lastmatch=]
            // match objects appear empty despite testing with regexs I know to work correctly including simple stuff like (.*) to rule out issues with my regex
            // println mt.group(0) // #No match found
            mt.group(0) // or a composition like mt.group(0) + "-" + mt.group(1)
    }
    .view()

I've also tried some variant on this using the replaceAll method.

I've consulted documentation for, nextflow, groovy and java and I still can't figure out what I'm missing. I expect it's some stupid syntactic thing or a misunderstanding of the data structure but I'm tired of banging my head against it when it's probably obvious to someone who knows the language better - I'd appreciate anyone who can enlighten me on how this works.


Solution

  • A closure can be provided to the fromfilepairs operator to implement a custom file pair grouping strategy. It takes a file and should return the grouping key. The example in the docs just groups the files by their file extensions:

    Channel
        .fromFilePairs('/some/data/*', size: -1) { file -> file.extension }
        .view { ext, files -> "Files with the extension $ext are $files" }
    

    This isn't necessary if all you want to do is alter the labels of some file pairs. You can use the map operator for this. The fromFilePairs op emits tuples in which the first element is the 'grouping key' of the matching pair and the second element is the 'list of files' (sorted lexicographically):

    Channel
        .fromFilePairs(params.fastq, checkIfExists:true) \
        .map { group_key, files ->
    
            tuple( group_key.replaceAll(/common_prefix/, ""), files )
        } \
        .view()