Search code examples
pythonpipelinesnakemake

snakemake: list of pathes in input


I am sorry for low level question, I am junior. I try to learn snakemake along with click. Please, help me to understand, for this example, how can I put a list of pathes to input in rule? And get this list in python script.

Snakemake:

path_1 = 'data/raw/data2process/'
path_2 = 'data/raw/table.xlsx'
    rule:
        input:
             list_of_pathes = "list of all pathes to .xlsx/.csv/.xls files from path_1"
             other_table = path_2
        output:
             {some .xlsx file}
        shell:
             "script_1.py {input.list_of_pathes} {output}"
             "script_2.py {input.other_table} {output}"

script_1.py:

@click.command()
@click.argument(input_list_of_pathes, type=*??*)
@click.argument("out_path",  type=click.Path())
def foo(input_list_of_pathes: list, out_path: str):
    df = pd.DataFrame()
    for path in input_list_of_pathes:
        table = pd.read_excel(path)
        **do smthng**
        df = pd.concat([df, table])
    df.to_excel(out_path)

script_2.py:

@click.command()
@click.argument("input_path", type=type=click.Path(exist=True))
@click.argument("output_path",  type=click.Path())
def foo_1(input_path: str, output_path: str):
    table = pd.read_excel(input_path)
    **do smthng**
    table.to_excel(output_path)

Solution

  • Using pathlib, and the glob method of a Path object, you could proceed as follows:

    from itertools import chain
    from pathlib import Path
    path_1 = Path('data/raw/data2process/')
    exts = ["xlsx", "csv", "xls"]
    path_1_path_lists = [
        list(path_1.glob(f"*.{ext}"))
        for ext in exts]
    path_1_all_paths = list(chain.from_iterable(path_1_dict.values()))
    

    The chain.from_iterables allows to "flatten" the list of lists, but I'm not sure Snakemake even needs this for the input of its rules.

    Then, in your rule:

    input:
        list_of_paths = path_1_all_paths,
        other_table = path_2
    

    I think that Path objects can be used directly. Otherwise, you need to turn them into strings with str:

    input:
        list_of_paths = [str(p) for p in path_1_all_paths],
        other_table = path_2