Search code examples
cudathrust

How to store data of a file using thrust::host_vector or device_vector?


The format of data is something like this:

TGCCACAGGTTCCACACAACGGGACTTGGTTGAAATATTGAGATCCTTGGGGGTCTGTTAATCGGAGACAGTATCTCAACCGCAATAAACCC
GTTCACGGGCCTCACGCAACGGGGCCTGGCCTAGATATTGAGGCACCCAACAGCTCTTGGCCTGAGAGTGTTGTCTCGATCACGACGCCAGT
TGCCACAGGTTCCACACAACGGGACTTGGTTGAAATATTGAGATCCTTGGGGGTCTGTTAATCGAAGACAGTATCTCAACCGCAATAAACCT
TGCCACAGGTTCCACACAACGGGACTTGGTTGAAATATTGAGATCCTTGGGGGTCTGTTAATCGAAGACAGTATCTCAACCGCAATAAACCT

Each line contains one sequence, I want to make a pair of (key ,value), key is one sequence and value is 1. Then use reduce_by_key to count the number of each sequence.

But I found that thrust::host_vector can only store one sequence, if I push_back the 2nd sequence the program crashed. Here is my code:

int main()
{
    ifstream input_subset("subset.txt");
    thrust::host_vector < string > h_output_subset;

    string s;
    while (getline(input_subset, s)) {
        h_output_subset.push_back(s);
    }
    cout << h_output_subset.size() << endl;
    return 0;
}

Is that possible to store all of data in a host_vector or a device_vector? Or is there any way to solve this problem?


Solution

  • The host_vector segfault was confirmed as a bug in thrust::uninitialised_copy and a patch has been applied to fix it.

    The problem doing this with a device_vector is a genuine limitation of CUDA (no std::string support) and can't be avoided. An alternative would be to use a fixed length char[] array as a data member in a device_vector, or use a single large device_vector to hold all the string data, with a second device_vector holding the starting index of each sub-string within the character array.