Search code examples
c++sortingduplicatestime-complexitybucket-sort

Can the Duplicate Characters in a string be Identified and Quantified in O(n)?


This comment suggests that there is a O(n) alternative to my O(n log n) solution to this problem:

Given string str("helloWorld") the expected output is:

l = 3
o = 2

My solution was to do this:

sort(begin(str), end(str));

for(auto start = adjacent_find(cbegin(str), cend(str)), finish = upper_bound(start, cend(str), *start); start != cend(str); start = adjacent_find(finish, cend(str)), finish = upper_bound(start, cend(str), *start)) {
   cout << *start << " = " << distance(start, finish) << endl;
}

Which is obviously limited by the sorting of str. I think this would require a bucket sort solution? Is there anything more clever that I'm missing?


Solution

  • Here's one way, which is O(N) at the expense of maintaining storage for every possible char value.

    #include <string>
    #include <limits.h> // for CHAR_MIN and CHAR_MAX. Old habits die hard.
    
    int main()
    {
        std::string s("Hello World");        
        int storage[CHAR_MAX - CHAR_MIN + 1] = {};
        for (auto c : s){
            ++storage[c - CHAR_MIN];
        }
    
        for (int c = CHAR_MIN; c <= CHAR_MAX; ++c){
            if (storage[c - CHAR_MIN] > 1){
                std::cout << (char)c << " " << storage[c - CHAR_MIN] << "\n";
            }
        }    
    }
    

    This portable solution is complicated by the fact that char can be signed or unsigned.