I have a Python dictionary whose keys are Strings consisting of lower-case English alphabets and values are ints. Moreover, there are exactly 5e6 unique keys, all of them are Strings with lengths of exactly 10. Surprisingly, the lookup isn't taking much time. I was expecting the execution to take around 4s or more, but it is not exceeding 2.5s.
I converted the Python code to C++, the analogy of Dictionary being a map
. I tried with map
, unordered_map
and gp_hash_table
, while all of them were taking more than 2s in C++.
Here's the Generator I've used to generate the unique strings.
from sys import stdout
def increment(l):
n = len(l)
i = n - 1
while i >= 0 and l[i] == 'z':
l[i] = 'a'
i -= 1
l[i] = chr(ord(l[i]) + 1)
print(5 * 10**6)
string = ['a' for i in range(10)]
for i in range(5 * 10**6):
stdout.write(''.join(string) + '\n')
increment(string)
string = ['a' for i in range(10)]
print(10**6)
for i in range(5 * 10**6):
stdout.write(''.join(string) + '\n')
increment(string)
The output of this program is stored in a file named Strings.txt
using the command python3 Test.py > Strings.txt
., after which the file Strings.txt
will look like this.
5000000
aaaaaaaaaa
aaaaaaaaab
aaaaaaaaac
aaaaaaaaad
aaaaaaaaae
aaaaaaaaaf
...
...
...
aaaaakymlr
1000000
aaaaaaaaaa
aaaaaaaaab
aaaaaaaaac
...
...
...
aaaaakymlq
aaaaakymlr
The following is the CPP Code I was referring to in the above context.
#include <bits/stdc++.h>
#include <ext/pb_ds/assoc_container.hpp>
using namespace std;
using namespace __gnu_pbds;
int main() {
ios_base::sync_with_stdio(false);
cin.tie(NULL);
int N = 0;
cin >> N;
map<string, int> freq;
// unordered_map<string, int> freq;
// gp_hash_table<string, int> freq;
for(int i = 0; i < N; i++) {
string s;
cin >> s;
freq[s]++;
}
int Q = 0;
cin >> Q;
for(int i = 0; i < Q; i++) {
string s;
cin >> s;
cout << freq[s] << '\n';
}
return 0;
}
And the following is the Python3 Code I used.
from collections import defaultdict
from sys import stdin, stdout
input = stdin.readline
freq = defaultdict(int)
for i in range(int(input())):
freq[input()] += 1
for i in range(int(input())):
stdout.write(str(freq[input()]) + '\n')
Here are the results when the above codes are executed.
suman@Skynet:~/Documents/String_Pairs$ python3 Test.py > Strings.txt
suman@Skynet:~/Documents/String_Pairs$ time python3 Dict.py < Strings.txt > P_out.txt
real 0m3.145s
user 0m2.662s
sys 0m0.164s
suman@Skynet:~/Documents/String_Pairs$ time python3 Dict.py < Strings.txt > P_out.txt
real 0m2.772s
user 0m2.568s
sys 0m0.204s
suman@Skynet:~/Documents/String_Pairs$ g++ -o exe Map.cpp -O2 -std=c++17
suman@Skynet:~/Documents/String_Pairs$ time ./exe < Strings.txt > Cpp_out.txt
real 0m2.346s
user 0m2.265s
sys 0m0.080s
suman@Skynet:~/Documents/String_Pairs$ time ./exe < Strings.txt > Cpp_out.txt
real 0m2.513s
user 0m2.417s
sys 0m0.096s
suman@Skynet:~/Documents/String_Pairs$ g++ -o exe Unordered_Map.cpp -O2 -std=c++17
suman@Skynet:~/Documents/String_Pairs$ time ./exe < Strings.txt > Cpp_out.txt
real 0m2.769s
user 0m2.660s
sys 0m0.108s
suman@Skynet:~/Documents/String_Pairs$ time ./exe < Strings.txt > Cpp_out.txt
real 0m2.806s
user 0m2.690s
sys 0m0.116s
suman@Skynet:~/Documents/String_Pairs$ g++ -o exe gp_hash_table.cpp -O2 -std=c++17
suman@Skynet:~/Documents/String_Pairs$ time ./exe < Strings.txt > Cpp_out.txt
real 0m2.099s
user 0m1.686s
sys 0m0.412s
suman@Skynet:~/Documents/String_Pairs$ time ./exe < Strings.txt > Cpp_out.txt
real 0m2.009s
user 0m1.605s
sys 0m0.404s
suman@Skynet:~/Documents/String_Pairs$
Now, the only thing I am concerned about is the fact that Python3 is 5x slower than CPP but still, it's Competing with CPP when working with hash tables.
Is there any way I can beat the time complexity of the Python hash table?
Any help is greatly appreciated.
UPDATE: Here's an update, that excludes time taken for reading strings.
#include <bits/stdc++.h>
#include <ext/pb_ds/assoc_container.hpp>
using namespace std;
using namespace __gnu_pbds;
using namespace chrono;
int main() {
ios_base::sync_with_stdio(false);
cin.tie(NULL);
int N = 0;
cin >> N;
vector<string> words(N);
for(int i = 0; i < N; i++) {
cin >> words[i];
}
// map<string, int> freq;
// unordered_map<string, int> freq;
gp_hash_table<string, int> freq;
auto start = high_resolution_clock::now();
for(string word : words) {
freq[word]++;
}
auto end = high_resolution_clock::now();
auto duration = duration_cast<microseconds>(end - start);
cout << duration.count() / 1e6 << '\n';
int Q = 0;
cin >> Q;
vector<string> queries(Q);
for(int i = 0; i < Q; i++) {
cin >> queries[i];
}
vector<int> results(Q);
start = high_resolution_clock::now();
for(int i = 0; i < Q; i++) {
results[i] = freq[queries[i]];
}
end = high_resolution_clock::now();
duration = duration_cast<microseconds>(end - start);
cout << duration.count() / 1e6 << '\n';
for(int i = 0; i < Q; i++) {
cout << results[i] << '\n';
}
return 0;
}
The Python Code
from collections import defaultdict
from time import time
from sys import stdin, stdout
input = stdin.readline
freq = defaultdict(int)
strings = []
for i in range(int(input())):
strings.append(input())
start = time()
for string in strings:
freq[string] += 1
end = time()
print("%.4f" %(end - start))
queries = []
output = []
for i in range(int(input())):
queries.append(input())
start = time()
for query in queries:
output.append(freq[query])
end = time()
print("%.4f" %(end - start))
stdout.write('\n'.join(map(str, output)))
Even now, Python is running faster than CPP. The results:
Cpp_out.txt (the time taken for map
, unordered_map
and gp_hash_table
- all are greater than 2s).
2.28297
0.109844
1
1
...
...
...
P_out.txt
1.7818
0.1977
1
1
...
...
UPDATE 2: I have modified the code so that I don't include time taken for reading or writing as well as using references everywhere. Now it is kinda acceptable that CPP
beats Python3
in hashing. Here are the benchmarks.
// CPP Code
#include <bits/stdc++.h>
#include <ext/pb_ds/assoc_container.hpp>
using namespace std;
using namespace __gnu_pbds;
using namespace chrono;
struct djb2 {
unsigned long operator()(const string& str) const {
unsigned long hash = 5381;
for (auto c : str)
hash = ((hash << 5) + hash) + c;
return hash;
}
};
int main() {
ios_base::sync_with_stdio(false);
cin.tie(NULL);
int N = 0;
cin >> N;
vector<string> words(N);
for(int i = 0; i < N; i++) {
cin >> words[i];
}
// map<string, int> freq;
// unordered_map<string, int> freq;
gp_hash_table<string, int> freq;
auto start = high_resolution_clock::now();
for(const string &word : words) {
freq[word]++;
}
auto end = high_resolution_clock::now();
auto duration = duration_cast<microseconds>(end - start);
cout << duration.count() / 1e6 << '\n';
int Q = 0;
cin >> Q;
vector<string> queries(Q);
for(int i = 0; i < Q; i++) {
cin >> queries[i];
}
vector<int> results(Q);
start = high_resolution_clock::now();
for(int i = 0; i < Q; i++) {
results[i] = freq[queries[i]];
}
end = high_resolution_clock::now();
duration = duration_cast<microseconds>(end - start);
cout << duration.count() / 1e6 << '\n';
for(int i = 0; i < Q; i++) {
cout << results[i] << '\n';
}
return 0;
}
# Python3 Code
from collections import defaultdict
from time import time
from sys import stdin, stdout
input = stdin.readline
freq = defaultdict(int)
strings = []
for i in range(int(input())):
strings.append(input())
start = time()
for string in strings:
freq[string] += 1
end = time()
print("%.4f" %(end - start))
queries = []
output = []
for i in range(int(input())):
queries.append(input())
start = time()
for query in queries:
output.append(freq[query])
end = time()
print("%.4f" %(end - start))
stdout.write('\n'.join(map(str, output)))
Cpp_out.txt
1.60026
0.071471
P_out.txt
1.7849
0.1987
So, it is clear that CPP's
gp_hash_table
beats Python3's
hashtable.
I've gone through Python3's implementation of hash table. They are using something called SIPHASH. I want to generate strings such that the number of collisions when the strings are hashed is maximum. It is something like a Collision attack, but I want at least $5000$ unique strings to produce same hash.
Can anyone provide any kind of resource for this (note that I need at least $5000$ unique strings that produce same hash).
I found the answer to this post myself. Python uses randomized Hashing algorithms, which will make it nearly impossible to generate two strings (consisting of entirely lowercase or entirely uppercase characters) that produce the same hash values.
Now, coming to the performance issues of C++ vs Python3, I was also taking the time taken to read and write into consideration, because of which it showed C++ slower than Python3.
When I rectified the codes and made sure there is no additional overheads in read/write or assigning (in the case of C++, I used const string&
to refer map entries), I found that C++ runs faster than Python3 when gp_hash_table
is used.
Thanks to everyone who spent time/efforts working on my question.