Search code examples
pythonpandasdictionaryvectorization

Is there a vectorize way to iterate through a fixed n of an input x dict instead the full range of n?


Given a fixed size n and a x dict of input key-value pairs, the goal is iterate through 1...n (1st index), then fetch the values from x if the index exists as x's key, otherwise insert the value -1.

I've tried the following and it kind of work as expected:

n = 10  
# Valid keys ranges from [1,10], any positive integer is valid in values.
x = {1:231, 2:341, 5:123} 
y = {i+1:x[i+1] if i+1 in x else -1 for i in range(n)}
y

[out]:

{1: 231, 2: 341, 3: -1, 4: -1, 5: 123, 6: -1, 7: -1, 8: -1, 9: -1, 10: -1}

But this seems like a very common pandas or encoding / embedding operation.

Is there a different/simpler way that can take in the sparse key-values from x and directly create y given that we know n without iterating through O(n) but instead O(len(x))?

Rationale being, if I've billions of Xs and n is substantially huge e.g. in 1000s then the full O(n) operation is really expensive.


Solution

  • The equivalent would be to reindex:

    y = pd.Series(x).reindex(range(1, n+1), fill_value=-1)
    

    However I believe that trying to assign all values from the beginning is probably not the right approach. Not matter whether python or C-speed, this will be algorithmically expensive if n is large.

    Rather use a defaultdict or setdefault to take advantage of "on-demand" creation of key/values in your dictionary:

    from collections import defaultdict
    y = defaultdict(lambda : -1, x)
    y[1]
    # 231
    y[4]
    # -1
    

    Or maybe:

    n = 100
    i = 50
    
    if i < n:
        x.setdefault(i, -1)
    else:
        raise ValueError(f'key {i} should be < {n}')