The format of data is something like this:
TGCCACAGGTTCCACACAACGGGACTTGGTTGAAATATTGAGATCCTTGGGGGTCTGTTAATCGGAGACAGTATCTCAACCGCAATAAACCC
GTTCACGGGCCTCACGCAACGGGGCCTGGCCTAGATATTGAGGCACCCAACAGCTCTTGGCCTGAGAGTGTTGTCTCGATCACGACGCCAGT
TGCCACAGGTTCCACACAACGGGACTTGGTTGAAATATTGAGATCCTTGGGGGTCTGTTAATCGAAGACAGTATCTCAACCGCAATAAACCT
TGCCACAGGTTCCACACAACGGGACTTGGTTGAAATATTGAGATCCTTGGGGGTCTGTTAATCGAAGACAGTATCTCAACCGCAATAAACCT
Each line contains one sequence, I want to make a pair of (key ,value), key is one sequence and value is 1. Then use reduce_by_key
to count the number of each sequence.
But I found that thrust::host_vector
can only store one sequence, if I push_back
the 2nd sequence the program crashed.
Here is my code:
int main()
{
ifstream input_subset("subset.txt");
thrust::host_vector < string > h_output_subset;
string s;
while (getline(input_subset, s)) {
h_output_subset.push_back(s);
}
cout << h_output_subset.size() << endl;
return 0;
}
Is that possible to store all of data in a host_vector
or a device_vector
? Or is there any way to solve this problem?
The host_vector segfault was confirmed as a bug in thrust::uninitialised_copy
and a patch has been applied to fix it.
The problem doing this with a device_vector
is a genuine limitation of CUDA (no std::string support) and can't be avoided. An alternative would be to use a fixed length char[]
array as a data member in a device_vector
, or use a single large device_vector
to hold all the string data, with a second device_vector
holding the starting index of each sub-string within the character array.