Search code examples
scalaapache-sparkapache-spark-sqlbioinformaticsfastq

Read FASTQ file into a Spark dataframe


I'm trying to read FASTQ files into Spark dataframes. I have some difficulties because FASTQ is a multi line format.

Example:

@seq1
AGTCAGTCGAC
+
?@@FFBFFDDH
@seq2
CCAGCGTCTCG
+
?88ADA?BDF8

Is there a way to get these data in a Spark dataframe like

+-------------+-------------+------------+
| identifier  | sequence    | quality    |
+-------------+-------------+------------+
|seq1         |AGTCAGTCGAC  |?@@FFBFFDDH |
|seq2         |CCAGCGTCTCG  |?88ADA?BDF8 |
+-------------+-------------+------------+

Thanks for your time


Solution

  • I'd slide

    import org.apache.spark.mllib.rdd.RDDFunctions._
    
    spark.createDataset(sc.textFile(path).sliding(4, 4).map {
      case Array(id, seq, _, qual) => (id, seq, qual)
    }).toDF("identifier", "sequence", "quality")
    
    
    // +----------+-----------+-----------+
    // |identifier|   sequence|    quality|
    // +----------+-----------+-----------+
    // |     @seq1|AGTCAGTCGAC|?@@FFBFFDDH|
    // |     @seq2|CCAGCGTCTCG|?88ADA?BDF8|
    // +----------+-----------+-----------+