Search code examples
c++chigh-load

How much every character occurs in the given string


I need to calculate how many times every character occurs in the given string. I need to do it on C or C++, I can use any library. The problem is that I am not a C/C++ developer, so I am not sure that my code is optimal. I want to get the best performance algorithm, it is the main reason for this question.

I am using the following code at the moment:

using namespace std;
...

char* text;        // some text, may be very long
int text_length;   // I know this value, if it can help

map<char,int> table;
map<char,int>::iterator it;

for(int i = 0; c = text[i]; i++) {
    it = table.find(c);
    if (it2 == table.end()) {
        table[c] = 1;
    } else {
        table[c]++;
    }
}

I may use any other structure except std::map, but I do not know which structure is better.

Thanks for your help!


Solution

  • You are doing it right using bucket sort. There cannot be a faster (non-parallel) algorithm for counting elements in a finite universe (such as characters).

    If you only use ASCII characters, you can use a simple array int table[256] to avoid the overhead of C++ containers.

    Using Duff's device (which is actually slower on some CPUs nowadays):

    int table[256];
    memset(table, 0, sizeof(table));
    int iterations = (text_length+7) / 8;
    switch(count % 8){
        case 0:      do {    table[ *(text++) ]++;
        case 7:              table[ *(text++) ]++;
        case 6:              table[ *(text++) ]++;
        case 5:              table[ *(text++) ]++;
        case 4:              table[ *(text++) ]++;
        case 3:              table[ *(text++) ]++;
        case 2:              table[ *(text++) ]++;
        case 1:              table[ *(text++) ]++;
                     } while(--iterations > 0);
    }
    

    Update: As MRAB remarked, processing text chunks in parallel might give you a perfomance boost. But be aware that creating a thread is quite expensive, so you should measure, what the lowest amount of characters is, which justifies the thread creation time.