Search code examples
c++python-3.xdictionaryhashhashtable

Defeating the time complexity of Python Dictionary


I have a Python dictionary whose keys are Strings consisting of lower-case English alphabets and values are ints. Moreover, there are exactly 5e6 unique keys, all of them are Strings with lengths of exactly 10. Surprisingly, the lookup isn't taking much time. I was expecting the execution to take around 4s or more, but it is not exceeding 2.5s.

I converted the Python code to C++, the analogy of Dictionary being a map. I tried with map, unordered_map and gp_hash_table, while all of them were taking more than 2s in C++.

Here's the Generator I've used to generate the unique strings.

from sys import stdout

def increment(l):
    n = len(l)
    i = n - 1
    while i >= 0 and l[i] == 'z':
        l[i] = 'a'
        i -= 1
    l[i] = chr(ord(l[i]) + 1)

print(5 * 10**6)

string = ['a' for i in range(10)]


for i in range(5 * 10**6):
    stdout.write(''.join(string) + '\n')
    increment(string)

string = ['a' for i in range(10)]

print(10**6)
for i in range(5 * 10**6):
    stdout.write(''.join(string) + '\n')
    increment(string)

The output of this program is stored in a file named Strings.txt using the command python3 Test.py > Strings.txt., after which the file Strings.txt will look like this.

5000000
aaaaaaaaaa
aaaaaaaaab
aaaaaaaaac
aaaaaaaaad
aaaaaaaaae
aaaaaaaaaf
...
...
...
aaaaakymlr
1000000
aaaaaaaaaa
aaaaaaaaab
aaaaaaaaac
...
...
...
aaaaakymlq
aaaaakymlr

The following is the CPP Code I was referring to in the above context.

#include <bits/stdc++.h>
#include <ext/pb_ds/assoc_container.hpp>
using namespace std;
using namespace __gnu_pbds;

int main() {
    ios_base::sync_with_stdio(false);
    cin.tie(NULL);
    int N = 0;
    cin >> N;
    map<string, int> freq;
    // unordered_map<string, int> freq;
    // gp_hash_table<string, int> freq;
    for(int i = 0; i < N; i++) {
        string s;
        cin >> s;
        freq[s]++;
    }
    int Q = 0;
    cin >> Q;
    for(int i = 0; i < Q; i++) {
        string s;
        cin >> s;
        cout << freq[s] << '\n';
    }
    return 0;
}

And the following is the Python3 Code I used.

from collections import defaultdict
from sys import stdin, stdout

input = stdin.readline

freq = defaultdict(int)
for i in range(int(input())):
    freq[input()] += 1

for i in range(int(input())):
    stdout.write(str(freq[input()]) + '\n')

Here are the results when the above codes are executed.

suman@Skynet:~/Documents/String_Pairs$ python3 Test.py > Strings.txt

suman@Skynet:~/Documents/String_Pairs$ time python3 Dict.py < Strings.txt > P_out.txt

real    0m3.145s
user    0m2.662s
sys     0m0.164s
suman@Skynet:~/Documents/String_Pairs$ time python3 Dict.py < Strings.txt > P_out.txt

real    0m2.772s
user    0m2.568s
sys     0m0.204s


suman@Skynet:~/Documents/String_Pairs$ g++ -o exe Map.cpp -O2 -std=c++17
suman@Skynet:~/Documents/String_Pairs$ time ./exe < Strings.txt > Cpp_out.txt

real    0m2.346s
user    0m2.265s
sys     0m0.080s
suman@Skynet:~/Documents/String_Pairs$ time ./exe < Strings.txt > Cpp_out.txt

real    0m2.513s
user    0m2.417s
sys     0m0.096s


suman@Skynet:~/Documents/String_Pairs$ g++ -o exe Unordered_Map.cpp -O2 -std=c++17
suman@Skynet:~/Documents/String_Pairs$ time ./exe < Strings.txt > Cpp_out.txt

real    0m2.769s
user    0m2.660s
sys     0m0.108s
suman@Skynet:~/Documents/String_Pairs$ time ./exe < Strings.txt > Cpp_out.txt

real    0m2.806s
user    0m2.690s
sys     0m0.116s


suman@Skynet:~/Documents/String_Pairs$ g++ -o exe gp_hash_table.cpp -O2 -std=c++17
suman@Skynet:~/Documents/String_Pairs$ time ./exe < Strings.txt > Cpp_out.txt

real    0m2.099s
user    0m1.686s
sys     0m0.412s
suman@Skynet:~/Documents/String_Pairs$ time ./exe < Strings.txt > Cpp_out.txt

real    0m2.009s
user    0m1.605s
sys     0m0.404s
suman@Skynet:~/Documents/String_Pairs$

Now, the only thing I am concerned about is the fact that Python3 is 5x slower than CPP but still, it's Competing with CPP when working with hash tables.

Is there any way I can beat the time complexity of the Python hash table?

Any help is greatly appreciated.

UPDATE: Here's an update, that excludes time taken for reading strings.

#include <bits/stdc++.h>
#include <ext/pb_ds/assoc_container.hpp>
using namespace std;
using namespace __gnu_pbds;
using namespace chrono;

int main() {
    ios_base::sync_with_stdio(false);
    cin.tie(NULL);
    int N = 0;
    cin >> N;

    vector<string> words(N);

    for(int i = 0; i < N; i++) {
        cin >> words[i];
    }

    // map<string, int> freq;
    // unordered_map<string, int> freq;
    gp_hash_table<string, int> freq;

    auto start = high_resolution_clock::now();
    for(string word : words) {
        freq[word]++;
    }

    auto end = high_resolution_clock::now();
    auto duration = duration_cast<microseconds>(end - start);
    cout << duration.count() / 1e6 << '\n';



    int Q = 0;
    cin >> Q;
    vector<string> queries(Q);
    for(int i = 0; i < Q; i++) {
        cin >> queries[i];
    }

    vector<int> results(Q);
    start = high_resolution_clock::now();
    for(int i = 0; i < Q; i++) {
        results[i] = freq[queries[i]];
    }
    end = high_resolution_clock::now();
    duration = duration_cast<microseconds>(end - start);
    cout << duration.count() / 1e6 << '\n';

    for(int i = 0; i < Q; i++) {
        cout << results[i] << '\n';
    }

    return 0;
}

The Python Code

from collections import defaultdict
from time import time
from sys import stdin, stdout

input = stdin.readline

freq = defaultdict(int)

strings = []

for i in range(int(input())):
    strings.append(input())

start = time()
for string in strings:
    freq[string] += 1
end = time()
print("%.4f" %(end - start))

queries = []
output = []

for i in range(int(input())):
    queries.append(input())

start = time()

for query in queries:
    output.append(freq[query])

end = time()

print("%.4f" %(end - start))

stdout.write('\n'.join(map(str, output)))

Even now, Python is running faster than CPP. The results:

Cpp_out.txt (the time taken for map, unordered_map and gp_hash_table - all are greater than 2s).

2.28297
0.109844
1
1
...
...
...

P_out.txt

1.7818
0.1977
1
1
...
...

UPDATE 2: I have modified the code so that I don't include time taken for reading or writing as well as using references everywhere. Now it is kinda acceptable that CPP beats Python3 in hashing. Here are the benchmarks.

// CPP Code
#include <bits/stdc++.h>
#include <ext/pb_ds/assoc_container.hpp>
using namespace std;
using namespace __gnu_pbds;
using namespace chrono;

struct djb2 {
    unsigned long operator()(const string& str) const {
        unsigned long hash = 5381;
        for (auto c : str)
            hash = ((hash << 5) + hash) + c;
        return hash;
    }
};


int main() {
    ios_base::sync_with_stdio(false);
    cin.tie(NULL);
    int N = 0;
    cin >> N;

    vector<string> words(N);

    for(int i = 0; i < N; i++) {
        cin >> words[i];
    }

    // map<string, int> freq;
    // unordered_map<string, int> freq;
    gp_hash_table<string, int> freq;

    auto start = high_resolution_clock::now();
    for(const string &word : words) {
        freq[word]++;
    }

    auto end = high_resolution_clock::now();
    auto duration = duration_cast<microseconds>(end - start);
    cout << duration.count() / 1e6 << '\n';



    int Q = 0;
    cin >> Q;
    vector<string> queries(Q);
    for(int i = 0; i < Q; i++) {
        cin >> queries[i];
    }

    vector<int> results(Q);
    start = high_resolution_clock::now();
    for(int i = 0; i < Q; i++) {
        results[i] = freq[queries[i]];
    }
    end = high_resolution_clock::now();
    duration = duration_cast<microseconds>(end - start);
    cout << duration.count() / 1e6 << '\n';

    for(int i = 0; i < Q; i++) {
        cout << results[i] << '\n';
    }

    return 0;
}
# Python3 Code
from collections import defaultdict
from time import time
from sys import stdin, stdout

input = stdin.readline

freq = defaultdict(int)

strings = []

for i in range(int(input())):
    strings.append(input())

start = time()
for string in strings:
    freq[string] += 1
end = time()
print("%.4f" %(end - start))

queries = []
output = []

for i in range(int(input())):
    queries.append(input())

start = time()

for query in queries:
    output.append(freq[query])

end = time()

print("%.4f" %(end - start))

stdout.write('\n'.join(map(str, output)))

Cpp_out.txt

1.60026
0.071471

P_out.txt

1.7849
0.1987

So, it is clear that CPP's gp_hash_table beats Python3's hashtable.

I've gone through Python3's implementation of hash table. They are using something called SIPHASH. I want to generate strings such that the number of collisions when the strings are hashed is maximum. It is something like a Collision attack, but I want at least $5000$ unique strings to produce same hash.

Can anyone provide any kind of resource for this (note that I need at least $5000$ unique strings that produce same hash).


Solution

  • I found the answer to this post myself. Python uses randomized Hashing algorithms, which will make it nearly impossible to generate two strings (consisting of entirely lowercase or entirely uppercase characters) that produce the same hash values.

    Now, coming to the performance issues of C++ vs Python3, I was also taking the time taken to read and write into consideration, because of which it showed C++ slower than Python3.

    When I rectified the codes and made sure there is no additional overheads in read/write or assigning (in the case of C++, I used const string& to refer map entries), I found that C++ runs faster than Python3 when gp_hash_table is used.

    Thanks to everyone who spent time/efforts working on my question.