I am doing something similar to the dynamic mapping and collect example in the documentation. The example lists files in a directory, maps each to a solid which computes the file size, and then collects the output for a summary of the overall size.
However, I would like to run multiple solids in parallel on each solid. So to continue with the example: I would list files in a directory; then map so that for each file I would compute the size, check the file permissions, and compute the md5sum all in parallel; and finally collect the output.
I can run these in sequence on each file with something like:
file_results = list_files()
.map(compute_size)
.map(check_permissions)
.map(compute_md5sum)
summarize(file_results.collect())
But if these aren't actually serial dependencies, it would be nice to parallelize the work on each file. Is there some syntax like this:
file_results = list_files().map(
compute_md5sum(check_permissions(compute_size)))
summarize(file_results.collect())
If I understand correctly, something like this should accomplish what you are looking for:
def _process_file(file):
size = compute_size(file)
perms = check_permissions(file)
hash = compute_md5sum(file)
return summarize_file(size, perms, hash)
file_results = list_files().map(_process_file)
summarize(file_results.collect)