Search code examples
c++sortingbucket-sortcounting-sort

Finding the beginning index for counting sort


int schoolToIndex(string school) {
    if (school == "UCB")  return 0;
    if (school == "UCD")  return 1;
    if (school == "UCI")  return 2;
    if (school == "UCLA") return 3;
    if (school == "UCM")  return 4;
    if (school == "UCSD") return 5;
    if (school == "UCSF") return 6;

    cerr << "Unknown school " << school << endl;
    return -1;
}



void sortByGroupById2(Student students[], int len) {
    int numberofschools = 7;
    int counters[numberofschools];

    for (int i = 0; i < numberofschools; i++) {
        counters[i] = 0;
    }

    for (int i = 0; i < numberofschools; i++) {
        counters[schoolToIndex(students[i].getSchool())]++;
    }

    Student *sortedArray = new Student[len];

    for (int i = 0; i < len; i++) {
    sortedArray[counters[schoolToIndex(students[i].getSchool())]] = students[i];
    counters[schoolToIndex(students[i].getSchool())]++;
    }

    for (int i = 0; i < len; i++) {
        students[i] = sortedArray[i];
    }

}

int main() {
    const int LEN = 350000;

    // Rough timing
    Student* uc2 = readStudentsFromFile("uc_students_sorted_by_id.txt", LEN);
    time(&start);
    sortByGroupById2(uc2, LEN);
    time(&end);
    cout << "Using counting sort it took " << difftime(end, start) << " seconds." << endl;

    writeStudentsToFile(uc1, LEN, "uc_by_school_by_id1.txt");
    writeStudentsToFile(uc2, LEN, "uc_by_school_by_id2.txt");
    return 0;
}

The specific problem I have in question is in the code

 sortedArray[counters[schoolToIndex(students[i].getSchool())]] = students[i],

I have the beginning index of sortedArray be the number of students of the school. What I am unsure of how to do is to have the beginning index be the cumulative number of students of the schools before.

For example if I want the beginning index of the UCLA, I would need to add the number of students of UCB and UCD and UCI in order to get the beginning index of this bucket.

So my plan of action would be to have the counters array to store the combined values of the number of students. For example if my counters array had [5, 10, 15, 20] as the number of students, I would want it to store [5, 15, 30, 50] to be the array of beginning indices for my sortedArray.

Is there any method I can use for this? Do i use recursion?


Solution

  • For an array of starting indexes, what you probably want to end up with is [0,5,15,30] (note the last count of 20 is not used). You could make counters 1 element larger to do this, or you could use two count variables. The counts need to scan for all students, which is len, not just number of schools.

    using two temp variables, sum and cnt:

        for (int i = 0; i < len; i++) {
            counters[schoolToIndex(students[i].getSchool())]++;
        }
    
        sum = 0;
        for (int i = 0; i < numberofschools; i++) {
            cnt = counters[schoolToIndex(students[i].getSchool())];
            counters[schoolToIndex(students[i].getSchool())] = sum;
            sum += cnt;
        }
    

    If you make counters one larger:

        int counters[numberofschools+1];
        // ...
        for (int i = 0; i <= numberofschools; i++) {
            counters[i] = 0;
        }
        for (int i = 0; i < len; i++) {
            // note the [1 + ...] only used here, not later in the actual sort
            counters[1+schoolToIndex(students[i].getSchool())]++;
        }
        for (int i = 2; i <= numberofschools; i++) {
            counters[schoolToIndex(students[i  ].getSchool())] += 
            counters[schoolToIndex(students[i-1].getSchool())];
        }
    

    In either case, the last count / index is not used, since that's the index to the end of the data, and the array is to be used as an array of starting indexes.

    The sort will be stable starting at the first element and ending with last element. I see another answer with the alternative method of starting with the last element traversing backwards to first element which is also stable, but not as cache friendly as starting with first element.