I'm trying to read FASTQ files into Spark dataframes. I have some difficulties because FASTQ is a multi line format.
Example:
@seq1
AGTCAGTCGAC
+
?@@FFBFFDDH
@seq2
CCAGCGTCTCG
+
?88ADA?BDF8
Is there a way to get these data in a Spark dataframe like
+-------------+-------------+------------+
| identifier | sequence | quality |
+-------------+-------------+------------+
|seq1 |AGTCAGTCGAC |?@@FFBFFDDH |
|seq2 |CCAGCGTCTCG |?88ADA?BDF8 |
+-------------+-------------+------------+
Thanks for your time
I'd slide
import org.apache.spark.mllib.rdd.RDDFunctions._
spark.createDataset(sc.textFile(path).sliding(4, 4).map {
case Array(id, seq, _, qual) => (id, seq, qual)
}).toDF("identifier", "sequence", "quality")
// +----------+-----------+-----------+
// |identifier| sequence| quality|
// +----------+-----------+-----------+
// | @seq1|AGTCAGTCGAC|?@@FFBFFDDH|
// | @seq2|CCAGCGTCTCG|?88ADA?BDF8|
// +----------+-----------+-----------+