cuda array sorting with thrust, not enough memory

I'm trying to sort an array using Thrust, but it doesn't work if the array is too big. (I have a GTX460 1GB memory)

I'm using cuda with c++ integration on VS2012, Here is my code :

my .cpp

extern "C" void thrust_sort(uint32_t *data, int n);

int main(int argc, char **argv){
    int n = 2<<26;
    uint32_t * v = new uint32_t[n];
    srand(time(NULL));
    for (int i = 0; i < n; ++i) {
        v[i] = rand()%n;
    }

    thrust_sort(v, n);

    delete [] v;
    return 0;
}

my .cu

extern "C"
void thrust_sort(uint32_t *data, int n){
    thrust::device_vector<uint32_t> d_data(data, data + n);
    thrust::stable_sort(d_data.begin(), d_data.end());
    thrust::copy(d_data.begin(), d_data.end(), data);
}

The program stop working at the start of stable_sort().

How much more memory does stable_sort() need ?
Is there a way to fix this ? (even if it makes it a bit slower or whatever)
Is there another sorting algorithm that doesn't require more memory than the original array ?

Thanks for your help :)

Solution

There are in the literature some techniques used to deal with the problem of sorting data that is too big to fit in RAM, such as saving partial values in files, and so on. An example: Sorting a million 32-bit integers in 2MB of RAM using Python

Your problem is less complicated since your input fits in RAM but is too much for your GPU. You can solve this problem by using the strategy parallel by Regular Sampling. You can see here an example of this technique applied to quicksort.

Long story short, you divide the array into smaller sub-arrays that fit on the memory of the GPU. Then you sort each of the sub-arrays, and in the end, you merge the results base on the premises of the Regular Sampling approach.

You can use a hybrid approach, sorting some of the sub-arrays in the CPU by assigning each one to a different core (using multi-threading), and at the same time, sending others sub-arrays to the GPU. You can even subdivide this work also to different processors using a message passing interface such as MPI. Or you can simply sort each sub-array one-by-one on the GPU and do the final merge step using the CPU, taking (or not) advantage of the multi-cores.