python list dictionary parsing keyword-argument

Efficient and/or readable ways to flatten nested **kwargs while preserving key-value pairings?

Description

I am trying to create a helper function which invokes another function multiple times. For the helper function, I want variables to be passed in as **kwargs so as to allow the main function to determine the default values of each parameter.

The arguments passed in can be variable length iterables and will be joined together into multiple dictionaries. Here is an example of what the input and parsed form should be:



{'param1': ['arg1'], 'param2': ['arg1', 'arg2', 'arg3'], 'param3': ['arg1', 'arg2']}

#=>

[{'param1': 'arg1', 'param2': 'arg1', 'param3': 'arg1'}, {'param2': 'arg2', 'param3': 'arg2'}, {'param2': 'arg3'}]

Is there any built-in function for python that allows you to flatten a dictionary in this way? I want to preserve the key-value pairings as they will be used as keyword arguments when invoking the main function.

What I tried:

First, I tried to avoid passing **kwargs into the main function by converting the arguments into lists and then passing them into itertools.zip_longest().


for data, param1, param2, param3 in itertools.zip_longest(external_data, argv1, argv2, argv3):
  foo(data, param1, param2, param3) # Invoke main function

However, this forces using None or some other filling value and shadows the defaults defined by the main function.

Second, I used a nested list comprehension to parse **kwargs and create a list of dictionaries similar to what I described above.



foo = [{k: v[idx]
       for k, v in kwargs.items() if idx < len(v) and v[idx] is not escape}
       for idx in range(len(longest_argument_list))]

However, this forced me to iterate over all the kwargs.values() to get the length of the longest argument list before parsing **kwargs.

What I am looking for

Ideally, there is a simpler way to flatten **kwargs into multiple dictionaries using a built-in function. If not, there may be a built-in that has better performance than the nested list comprehension method.

It would be nice, but not necessary, to allow some form of sentinel value to signal the need to skip over a specific function invocation's argument (e.g. passing in param1=['arg1', None, 'arg3'] to allow the second invocation of main to use the default value for param1).

Script to showcase intended behaviour:

import collections
import inspect


def invoked_function(param, param1=None, param2='', param3='.'):
  """This function only prints its own call, but it would
   perform some actions using param and **kwargs"""
  variables = inspect.currentframe().f_locals
  function = inspect.currentframe().f_code.co_name
  output = f'{function}(**{variables})'
  print(output)


def helper_function(**kwargs):
  external_data = ['target1', 'target2', 'target3', 'target4']
  longest_argument_list = max(kwargs.values(), key=len)
  escape = None
  foo = [{k: v[idx]
         for k, v in kwargs.items() if idx < len(v) and v[idx] is not escape}
         for idx in range(len(longest_argument_list))]
  foo = collections.deque(foo)
  for target in external_data:
    kwargs = foo.popleft() if foo else {}
    invoked_function(target, **kwargs)


if __name__ == '__main__':
  helper_function(param1=['arg1'],
                  param2=['arg1', 'arg2', 'arg3'],
                  param3=['arg1', 'arg2'])

The above script works as is.

Solution

After thinking over the problem more, I realised that the original program structure could be improved by separating the flattening function from the helper function.

Using Andrej Kesely's Answer as inspiration on how to use zip_longest, I came up with this solution:

def generate_flattened_kwargs(**kwargs):
  keyword_argument_mappings = map(zip,
                                  itertools.cycle([kwargs]),
                                  itertools.zip_longest(*kwargs.values()))
  for keyword_arguments in list(keyword_argument_mappings):
    flat_kwargs = dict(keyword_arguments)
    yield flat_kwargs

The most notable drawback is that kwargs with the itertools.zip_longest set fillvalue are not filtered out.

In exchange, this implementation is faster than the dict expression construction.

Updated Function with sentinels

After doing some more thinking on this problem (and needing a function that supports sentinel values) I wrote this up:

def generate_flattened_kwargs_with_sentinel(sentinel=None, **kwargs):
  arguments = itertools.zip_longest(*kwargs.values(), fillvalue=sentinel)
  arguments, sentinel_filter = itertools.tee(arguments)
  keyword_argument_pairs = map(zip,
                               itertools.cycle([kwargs]),
                               arguments)
  filtered_pairs = map(itertools.compress,
                       keyword_argument_pairs,
                       sentinel_filter)
  for kwargs in filtered_pairs:
    yield dict(kwargs)

It is almost as fast as the above function which does not have sentinel values. However, it gives up readability and code flow in exchange for the sentinel values. Additionally sentinel values must be Falsy (e.g 0, None, [], {}, '', etc.)

Benchmark

Here are some numbers I got on my system using the timeit module:

Running tests for many_parameters_few_arguments():
It took 1.3088s to complete andrew_sentinel_function.
It took 1.2698s to complete andrew_nosentinel_function.
It took 2.1734s to complete showcase_function.
It took 1.5139s to complete Andrej_Kesely_function.

Running tests for few_parameters_many_arguments():
It took 0.6311s to complete andrew_sentinel_function.
It took 0.6316s to complete andrew_nosentinel_function.
It took 1.0176s to complete showcase_function.
It took 0.7964s to complete Andrej_Kesely_function.

Unfortunately, the system I was using to test didn't have enough ram for many_parameters_many_arguments().

Script

import timeit
import random as r
import itertools


def andrew_sentinel_function(sentinel=None, **kwargs):
  arguments = itertools.zip_longest(*kwargs.values(), fillvalue=sentinel)
  arguments, sentinel_filter = itertools.tee(arguments)
  keyword_argument_pairs = map(zip,
                               itertools.cycle([kwargs]),
                               arguments)
  filtered_pairs = map(itertools.compress,
                       keyword_argument_pairs,
                       sentinel_filter)
  return [dict(kwargs) for kwargs in filtered_pairs]


def andrew_nosentinel_function(**kwargs):
  keyword_argument_mappings = map(zip,
                                  itertools.cycle([kwargs]),
                                  itertools.zip_longest(*kwargs.values()))
  return [dict(keyword_arguments)
          for keyword_arguments in list(keyword_argument_mappings)]


def showcase_function(**kwargs):
  longest_argument_list = max(kwargs.values(), key=len)
  escape = None
  return [{k: v[idx]
          for k, v in kwargs.items() if idx < len(v) and v[idx] is not escape}
          for idx in range(len(longest_argument_list))]

def Andrej_Kesely_function(**kwargs):
  return [{param: value for param, value in zip(kwargs, t1) if not value is None}
          for t1 in itertools.zip_longest(*kwargs.values(), fillvalue=None)]


def few_parameters_many_arguments():
  r.seed(42)
  n_parameters = 100
  n_arguments = 100000
  parameters = [str(i) for i in range(n_parameters)]
  arguments = [[r.randrange(100) for _ in range(n_arguments)]
               for _ in range(n_parameters)]
  return dict(zip(parameters, arguments))


def many_parameters_few_arguments():
  r.seed(42)
  n_parameters = 100000
  n_arguments = 100
  parameters = [str(i) for i in range(n_parameters)]
  arguments = [[r.randrange(100) for _ in range(n_arguments)] for _ in range(n_parameters)]
  return dict(zip(parameters, arguments))


def many_parameters_many_arguments():
  r.seed(42)
  n_parameters = 100000
  n_arguments = 100000
  parameters = [str(i) for i in range(n_parameters)]
  arguments = [[r.randrange(100) for _ in range(n_arguments)] for _ in
               range(n_parameters)]
  return dict(zip(parameters, arguments))


if __name__ == '__main__':
  functions = [
    andrew_sentinel_function,
    andrew_nosentinel_function,
    showcase_function,
    Andrej_Kesely_function
  ]
  setups = [
    'kwargs = many_parameters_few_arguments()',
    'kwargs = few_parameters_many_arguments()',
    'kwargs = many_parameters_many_arguments()'
  ]
  for setup in setups:
    print('')
    print(f"Running tests for {setup.split(' = ')[1]}:")
    for function in functions:
      time = timeit.timeit(f'function(**kwargs)', setup, globals=globals(), number=1)
      print(f'It took {time:.4f}s to complete {function.__name__}.')
  for setup in setups:
    print('')
    print(f"Running tests for {setup.split(' = ')[1]}:")
    for function in functions:
      time = timeit.timeit(f'function(**kwargs)', setup, globals=globals(), number=1)
      print(f'It took {time:.4f}s to complete {function.__name__}.')

The difference most likely lies in list() and dict() as they are implemented in C while dict and list expressions are in Python byte code.

In effect, the difference is fairly minimal and will not make much of an impact unless you are processing large amounts of data. I ended up using the generator implementation as it led to better code reusability in my project.