Search code examples
pythonsethashtable

The difference between set definitions in Python


Let us consider the following four ways to define a set in python, which will contain the elements of an iterable l, the elements of which are l_1, l_2, l_3, ..., l_n:

Method 1:

def defset(l):
    x = set()
    x.update(l)
    return x

Method 2:

set(l)

Method 3:

{i for i in l}

Method 4:

{l_1, l_2, l_3,..., l_n}

The first 3 methods give identical results, but the fourth one doesn't, for some reason.

For example, if I do this:

s1 = defset([1,2,3,11]
s2 = set([1,2,3,11])
s3 = {i for i in [1,2,3,11]}
s4 = {1,2,3,11}

print(s1, s2, s3, s4, sep = "\n")

The output is:

{11,1,2,3}
{11,1,2,3}
{11,1,2,3}
{3,1,2,11}

So why is it that the fourth method gives a set that is different from the sets obtained by the other methods? I know that in the context of sets, all four sets are equal to each other, but a difference arises in the implementation, which changes the order in which the elements are stored in the internal hash table, right?

Why does this difference arise?

Edit

This question does not answer my question. I've already watched The Mighty Dictionary and the follow-up talk The Dictionary Even Mightier regarding how dictionaries store data in the form of a hash table. I'm aware that sets do something similar, and that based the 'order' in the hash table, they iterate over/display their elements.

I mean to ask why defining a set through these seemingly identical and equivalent methods results in the data being stored differently (internally, in the corresponding hash table).


Solution

  • Because the order isn't guaranteed, the answer is likely to differ for different Python implementations, different versions of the same implementation, different computers, different invocations on the same computer, and so on.

    Given that, dis can still shed some light on your particular version. Here is what it does for me:

    Python 3.12.0 (tags/v3.12.0:0fb18b0, Oct  2 2023, 13:03:39) [MSC v.1935 64 bit (AMD64)] on win32
    Type "help", "copyright", "credits" or "license" for more information.
    >>> dis.dis('''s1 = defset([1,2,3,11])
    ... s2 = set([1,2,3,11])
    ... s3 = {i for i in [1,2,3,11]}
    ... s4 = {1,2,3,11}
    ... ''')
      0           0 RESUME                   0
    
      1           2 PUSH_NULL
                  4 LOAD_NAME                0 (defset)
                  6 BUILD_LIST               0
                  8 LOAD_CONST               0 ((1, 2, 3, 11))
                 10 LIST_EXTEND              1
                 12 CALL                     1
                 20 STORE_NAME               1 (s1)
    
      2          22 PUSH_NULL
                 24 LOAD_NAME                2 (set)
                 26 BUILD_LIST               0
                 28 LOAD_CONST               0 ((1, 2, 3, 11))
                 30 LIST_EXTEND              1
                 32 CALL                     1
                 40 STORE_NAME               3 (s2)
    
      3          42 LOAD_CONST               0 ((1, 2, 3, 11))
                 44 GET_ITER
                 46 LOAD_FAST_AND_CLEAR      0 (i)
                 48 SWAP                     2
                 50 BUILD_SET                0
                 52 SWAP                     2
            >>   54 FOR_ITER                 4 (to 66)
                 58 STORE_FAST               0 (i)
                 60 LOAD_FAST                0 (i)
                 62 SET_ADD                  2
                 64 JUMP_BACKWARD            6 (to 54)
            >>   66 END_FOR
                 68 SWAP                     2
                 70 STORE_FAST               0 (i)
                 72 STORE_NAME               4 (s3)
    
      4          74 BUILD_SET                0
                 76 LOAD_CONST               1 (frozenset({3, 1, 2, 11}))
                 78 SET_UPDATE               1
                 80 STORE_NAME               5 (s4)
                 82 RETURN_CONST             2 (None)
            >>   84 SWAP                     2
                 86 POP_TOP
    

    The first three all start out by loading a constant value built into the compiled code: (1, 2, 3, 11)

    The last one starts out different: it loads a different constant value: frozenset({3, 1, 2, 11}).

    What then happens is fairly similar for all four methods: it makes a new set and adds all those values -- either directly or indirectly.

    So the difference is that for s4, at compile time, Python creates a constant frozenset. Presumably, when sets are created from frozensets at runtime, they just keep the same order (in this version at least).

    See what happens if you make a set using the set display that can't be constructed from a constant:

    >>> dis.dis('s = 11; s1 = {1, 2, 3, s}')
      0           0 RESUME                   0
    
      1           2 LOAD_CONST               0 (11)
                  4 STORE_NAME               0 (s)
                  6 LOAD_CONST               1 (1)
                  8 LOAD_CONST               2 (2)
                 10 LOAD_CONST               3 (3)
                 12 LOAD_NAME                0 (s)
                 14 BUILD_SET                4
                 16 STORE_NAME               1 (s1)
                 18 RETURN_CONST             4 (None)
    >>> s = 11
    >>> {1, 2, 3, s}
    {11, 1, 2, 3}
    

    This set too has to be constructed at run time, even though you're using the {...} syntax.

    TL;DR: Probably the difference between sets constructed at compile time versus sets constructed during runtime.