Search code examples
dagster

Can I use the output of one DynamicResource after a map() for multiple solids?


I am doing something similar to the dynamic mapping and collect example in the documentation. The example lists files in a directory, maps each to a solid which computes the file size, and then collects the output for a summary of the overall size.

However, I would like to run multiple solids in parallel on each solid. So to continue with the example: I would list files in a directory; then map so that for each file I would compute the size, check the file permissions, and compute the md5sum all in parallel; and finally collect the output.

I can run these in sequence on each file with something like:

file_results = list_files()
.map(compute_size)
.map(check_permissions)
.map(compute_md5sum)
summarize(file_results.collect())

But if these aren't actually serial dependencies, it would be nice to parallelize the work on each file. Is there some syntax like this:

file_results = list_files().map(
compute_md5sum(check_permissions(compute_size)))
summarize(file_results.collect())

Solution

  • If I understand correctly, something like this should accomplish what you are looking for:

    def _process_file(file):
        size = compute_size(file)
        perms = check_permissions(file)
        hash = compute_md5sum(file)
        return summarize_file(size, perms, hash)
    
    
    file_results = list_files().map(_process_file)
    summarize(file_results.collect)