In python 2.7, in order to retrieve the set of unique strings from a redundant list of strings, what is preferred (~10 million strings of length ~20):
a) sort the list and delete repeating strings
sort(l)
unique(l) #some linear time function
b) just put them all in a set
set(l)
Note that I do not care about the order of the strings.
I made a simple test to check the running time of both solutions. The first test creates a set
and the second test sorts the list (it doesn't remove duplicates for the sake of simplicity).
As expected creating a set is much faster than sorting, since its complexity is O(n)
while sorting is O(nlogn)
.
import random
import string
import time
def random_str():
size = random.randint(10, 20)
chars = string.ascii_letters + string.digits
return ''.join(random.choice(chars) for _ in range(size))
l = [random_str() for _ in xrange(1000000)]
t1 = time.clock()
for i in range(10):
set(l)
t2 = time.clock()
print(round(t2-t1, 3))
t1 = time.clock()
for i in range(10):
sorted(l)
t2 = time.clock()
print(round(t2-t1, 3))
The output I got:
2.77
11.83