python algorithm time-complexity analysis space-complexity

What is the time complexity of this function that iterates through a list a creates a dictionary?

I have a function that rearranges an input list in a certain way and returns the output list. I am confused about what the time and space complexity of the function will be. Below is the code:

def rearrange_list(inp_list):
    d = {}
    output_list = []
    for num in inp_list:
        if num in d:
            d[num] += 1
        else:
            d[num] = 0
            output_list.append(num)
    for k,v in d.items():
        if v > 0:
            for i in range(v):
                output_list.append(k)
    return output_list

This is my complexity analysis:

Time complexity: O(n + m²) where n is length of the input list and m is the size of dictionary
Space complexity: O(n) where n is the length of input list

The main confusion I have is should I consider iterating through the dictionary O(n) too since worst case we will have n items in the list, or should it be represent it by m like I did in my analysis since it can be anything from 0 to n?

Thank you in advance for your help!

Solution

Your time and space complexity are both Theta(n). While sometimes it can be useful for clarity to include terms in the time or space complexity that don't change the asymptotic value (a classic example being string searching algorithms), it doesn't make as much sense here.

While your claim of O(n + m^2) time complexity is technically correct as Big-O notation is an upper bound, you can show that O(n) is also an upper bound, since the dictionary has size at most n and we iterate over each key exactly once: there are n items read from input, at most n loop iterations for the dictionary, and n items appended to the output list.

If you wanted, you can calculate 'auxiliary' space required, which would be the amount of space needed but not counting the input or output arrays. Here, that would be Theta(m). You should note, however, that such analyses are fairly uncommon: by assumption, unless specified otherwise, space complexity analysis will include the size of the output.

To address a common confusion about why the second loop is still linear time with many duplicate values, let's look at an example.

The lines in question are:

for k, v in d.items():
    if v > 0:
        for i in range(v):
            output_list.append(k)

Suppose our input list was [1, 1, 1, 1, 1, 1, 1, 2, 2, 2] (ten elements total: seven '1's and three '2's).

Then our dictionary.items() (which has the count of each element, minus one) would look like: [(key: 1, value: 6), (key: 2, value: 2)] (it's not really stored as a Python list of tuples internally, but these are the full contents of the items view-object).

Let's walk through that second loop's operation, line by line:

for k, v in [(key: 1, value: 6), (key: 2, value: 2)]: 
# On our first iteration, so k = 1, v = 6.

    if 6 > 0:                      # 1 operation
        for i in range(6):         # 6 operations
            output_list.append(1)  # 6 operations

# For the (k, v) pair of (1, 6), the inner-loop has run 6 times.
# Every time the inner-loop runs, the length of output_list
# increases by 1

# Second iteration of outer-loop runs again:
for k, v in [(key: 1, value: 6), (key: 2, value: 2)]: 
# On our second iteration, so k = 2, v = 2.

    if 2 > 0:                      # 1 operation
        for i in range(2):         # 2 operations
            output_list.append(2)  # 2 operations

# For the (k, v) pair of (1, 6), the inner loop has run 2 times.
# In total: 8 inner-loop iterations, and output_list has len 8

In informal complexity analysis, a 'standard' rule of thumb is that the run-time of a double-nested loop is often quadratic. This is because we're counting the total number of inner-loop iterations, for example

for i in range(n):
    for j in range(n):

(n inner-loops per outer-loop) * (n outer-loops) = n^2 inner-loops

This assumption shouldn't be applied when the number of inner-loop iterations varies dramatically based on the state of the outer-loop. In our example, the inner-loop iterations is v, which depends on the outer loop.

To find the runtime here, we need a different way to count how many inner-loop iterations occur. Luckily, we can do that: in each inner-loop iteration, we append an element to output_list.

Since the final length of output_list is n, we know that the inner-loop has executed at most n times (technically, it's executed exactly n-m times, since the output_list already has size m after the earlier dictionary-initializing loop has terminated). Instead of incorrectly multiplying this by m, the number of outer-loop iterations, we should instead add the inner and outer loop iterations for a runtime of Theta(n+m) which is Theta(n).

Addendum: Comments have correctly pointed out that since Python dictionaries don't have an O(1) amortized worst-case lookup/insert time guarantee, so the first loop is, at best, Omega(m*n). While Python uses pseudo-random probing on an open-addressing table, this only ensures good 'average' performance. Thanks to Kelly Bundy for the highly informative discussion and corrections.

Unfortunately, while O(1) amortized worst-case lookup/insert hashing is possible, for example with Cuckoo hashing, in practice this is significantly slower on average than what's currently used in most standard libraries, and is unlikely to change in the near future.