Search code examples
pythonpython-2.7sortingsetunique

python - Sort & Unique vs Set


In python 2.7, in order to retrieve the set of unique strings from a redundant list of strings, what is preferred (~10 million strings of length ~20):

a) sort the list and delete repeating strings

sort(l)
unique(l) #some linear time function

b) just put them all in a set

set(l)

Note that I do not care about the order of the strings.


Solution

  • I made a simple test to check the running time of both solutions. The first test creates a set and the second test sorts the list (it doesn't remove duplicates for the sake of simplicity).

    As expected creating a set is much faster than sorting, since its complexity is O(n) while sorting is O(nlogn).

    import random
    import string
    import time
    
    
    def random_str():
        size = random.randint(10, 20)
        chars = string.ascii_letters + string.digits
        return ''.join(random.choice(chars) for _ in range(size))
    
    
    l = [random_str() for _ in xrange(1000000)]
    
    t1 = time.clock()
    for i in range(10):
        set(l)
    t2 = time.clock()
    print(round(t2-t1, 3))
    
    t1 = time.clock()
    for i in range(10):
        sorted(l)
    t2 = time.clock()
    print(round(t2-t1, 3))
    

    The output I got:

    2.77
    11.83