I am new to c++ and I need help when reading and changing lines from an input file to use the next lines and saving it into another output file.
I have an example of a single DNA sequence stored in .fastq format, with the following structure.
@Read_1
AGACUUUACGCT
+
++//187-,/02
So each DNA sequence has four lines worth of information.
My goal is to split up the DNA string (line 2, length 12) into different fragments of random length and save each fragment as a separate new sequence. But to keep the .fastq structure I need to keep the information from lines 3 and 4! So the ideal output would be:
@Read_1_1
AGAC
+
++//
@Read_1_2
UU
+
18
@Read_1_3
UACGCT
+
7-,/02
In this ideal output, line 4 from the input has been split to match each DNA fragment (but i can do this with substr, so that's not a problem). My problem is when i am splitting the DNA sequences (line 2) and saving them as a new read, i need the information from line 3 and 4.
I am coding in C++ and i have made some functions which works and made some different attempts which fails:
When I'm opening the file I have made a function (DNA_fragmentation) which randomly splits the DNA (line2) into some fragments, like so:
AGAC
UU
UACGCT
So when I am using this function I am reading line 2, then saving these fragments into a std::vectorstd::string and using a for loop to save these fragments and their read (from line 1) into a new file, giving me the input:
@Read_1_1
AGAC
@Read_1_2
UU
@Read_1_3
UACGCT
My issue is I don't know how to add line 3 and line 4 for each of the new fragments, since they are created when im opening and reading line 2 from the original file. How do i extract information from the next lines?
To read the file and seperate the functions im using this:
#include <iostream>
#include <fstream>
#include <string>
#include <vector>
std::string fafq_seq(std::string in_name, std::string out_name) {
std::ifstream myfile(in_name);
std::ofstream out_file(out_name);
if (myfile.is_open() && out_file.is_open())
{
std::string line;
while( std::getline(myfile,line) )
{
int ID = 1;
std::string read_ID;
// This is line 1, which always match with @
if (line.rfind("@",0) == 0) {
continue;
} // Then reading line 2 the DNA sequence
else if (line.rfind("A", 0) == 0 ||
line.rfind("T", 0) == 0 ||
line.rfind("G", 0) == 0 ||
line.rfind("C", 0) == 0){
std::string Seq = line;
// creating a vector with each of the DNA pieces using my DNA_fragmentation function
std::vector<std::string> Damage = DNA_fragmentation(Seq,2,8);
// For each fragment im adding a new read and saving the output
for (int i=0; i<Damage.size();i++){
// adding what corresponds to line 1 starting with @
out_file << "@Read_" << ID << "_" << i+1 << std::endl;
// adding the DNA pieces
out_file << Damage[i] << std::endl;
}
ID += 1;
}
else {
// iterating through line 3 and 4, which is where im not sure how to handle my problem
out_file << line << std::endl;
}
}
out_file.close();
myfile.close();
}
}
int main() {
std::string File = "TestSeq.fastq";
fafq_seq(File,"Test_out.fastq");
return 0;
}
I know its a long question and a bit difficult for me to explain further, but i hope this problem makes sense. But just any comments or help would be appreciated. Thanks.
I think you could make your task much easier overall by reading in the full fastq fragment first, then split it into multiple fragments, and then finally output it all again.
If you make a struct for the fragment and add input and output operators (operator>>
and operator<<
) for it, then you can do the reading and writing in a very simple way:
#include <fstream>
#include <iostream>
#include <string>
#include <vector>
struct FastqFragment
{
std::string ID;
std::string sequence;
std::string delim;
std::string quality_value;
};
std::istream& operator>>(std::istream& in, FastqFragment& frag)
{
std::getline(in, frag.ID);
if (frag.ID.size() == 0 || frag.ID[0] != '@') {
in.setstate(std::ios_base::failbit);
return in;
}
std::getline(in, frag.sequence);
std::getline(in, frag.delim);
if (frag.delim.size() == 0 || frag.delim[0] != '+') {
in.setstate(std::ios_base::failbit);
return in;
}
std::getline(in, frag.quality_value);
return in;
}
std::ostream& operator<<(std::ostream& out, const FastqFragment& frag)
{
out << frag.ID << '\n';
out << frag.sequence << '\n';
out << frag.delim << '\n';
out << frag.quality_value << '\n';
return out;
}
I tried to add some very basic validation to the read operator, as you can see. Now you can use it something like this:
int main()
{
std::ifstream in("sequence.txt");
std::vector<FastqFragment> frags;
for (FastqFragment tmp; in >> tmp;) {
frags.push_back(tmp);
}
// Insert code for mutating the fragments
for (const auto& f : frags)
std::cout << f;
// or
std::ofstream out("output.txt");
for (const auto& f : frags)
out << f;
}
Now your DNA_fragmentation
code can take a FastqFragment struct as argument and split all the strings that needs to be split at the same time.