Search code examples
rdata-sciencebioinformaticsmlops

TaskRun failed to finish due to an error for Coretex BioInformatics workflow


After starting bioinformatics workflow in Coretex, I am getting the following message even though data seems to be in order: "Failed to determine which column contains sampleIDs/names..." and then the list of available names, but I am using one from the list.

I am trying to run a microbiome sequencing task in Coretex, and I have used standard microbiome sequencing data in .fastq.gz format. Run should have been successful but it is failing every time.

I've worked with this R code for uploading metadata:

loadMetadata <- function(metadataSample) {
metadata_csv_path <- builtins$str(
    metadataSample$joinPath("metadata.csv")
)

if (file.exists(metadata_csv_path)) {
    # Default SampleSheet.csv format
    metadata <- read.table(
        metadata_csv_path,
        sep = ",",
        header = TRUE,
        check.names = TRUE
    )
} else {
    # Format accepted by qiime2
    metadata_tsv_path <- builtins$str(
        metadataSample$joinPath("metadata.tsv")
    )

    if (!file.exists(metadata_tsv_path)) {
        stop("Metadata file not found")
    }

    metadata <- read.table(
        metadata_tsv_path,
        sep = "\t",
        header = TRUE,
        check.names = TRUE
    )

    # qiime has 1 extra row after header which contains types
    metadata <- metadata[-1,]
}

# Remove leading and trailing whitespace
colnames(metadata) <- lapply(colnames(metadata), trimws)

stringColumns <- names(metadata)[vapply(metadata, is.character, logical(1))]
metadata[, stringColumns] <- lapply(metadata[, stringColumns], trimws)

sampleIdColumn <- getSampleIdColumnName(metadata)
print(paste("Matched metadata sample ID/name column to", sampleIdColumn))

print("Renaming metadata sample ID/name column to \"sampleId\"")
names(metadata)[names(metadata) == sampleIdColumn] <- "sampleId"

print("Metadata")
print(colnames(metadata))
print(head(metadata))

print(metadata$sampleId)

# assign the names of samples (01Sat1...) to metadata rows instead of 1,2,3...
row.names(metadata) <- metadata$sampleId
metadata$sampleId <- as.factor(metadata$sampleId)

return(metadata)

}


Solution

  • Judging by the logs of your Coretex Workflow it looks like your Dataset contains metadata.csv file which uses ; as a separator, but the Coretex Task for loading BioInformatics data tries to load it with a , as a separator. This was changed in the latest version of the Task and you can see the full changelog here.

    Instead of always forcing the separator to be , (old version):

        # Default SampleSheet.csv format
        metadata <- read.table(
            metadata_csv_path,
            sep = ",",
            header = TRUE,
            check.names = TRUE
        )
    

    It will now try to automatically determine what the separator is using fread function (new version):

        metadata <- fread(metadata_csv_path, data.table=FALSE)