I have 3 files:
output/file_2.txt
output/file_4.txt
output/file_6.txt
Each one of these files have sample names inside, for example output/file2.txt
looks like this, other files follow same logic:
sample1
sample2
sample3
Now I have a function that reads through all the files and stores in a list the sample names:
def extract_ind():
l = []
for n in [2,4,6]:
with open(f'output/file_{n}.txt') as f:
l.extend([line.strip() for line in f])
return l
Then I have the default rule that collects everything and the rule that extracts samples and creates new files with sample names :
rule all:
input:
expand('output/file_{n}_{sample}.vcf.gz',n=[2,4,6],sample=extract_ind())
rule extract_samples:
input:
vcf = 'muscle/file_{n}.vcf.gz'
output:
vcf_dir = 'output/file_{n}_{sample}.vcf.gz'
shell:
'bcftools query -l {input.vcf} | xargs -I {{}} bcftools view -O z -s {{}} -o {output.vcf_dir} {input}'
The way this is designed currently, rule all
will use itertools product
to generate all combinations of sample names and files:
For example file_2.txt
will have results like this:
output/file_2_sample1.vcf.gz
output/file_2_sample2.vcf.gz
output/file_2_sample3.vcf.gz
output/file_2_sample4.vcf.gz -----> ops, don't want that `sample4` is coming from `file_4`
In order to overcome this issue I naively created an input function, but I guess this is not the way to go since input functions are used for inputs and not target outputs right ?
def use_ind(wildcards):
l = []
if wildcards.n == 2:
with open(f'output/file_2.txt') as f:
l.extend([line.strip() for line in f])
return l
elif wildcards.n == 4:
with open(f'output/file_4.txt') as f:
l.extend([line.strip() for line in f])
return l
elif wildcards.n == 6:
with open(f'output/file_6.txt') as f:
l.extend([line.strip() for line in f])
return l
I ultimately would like to create some logic to have something like this as result:
output/file_2_sample1.vcf.gz
output/file_2_sample2.vcf.gz
output/file_2_sample3.vcf.gz
output/file_4_sample4.vcf.gz
output/file_4_sample6.vcf.gz
etc....
That is every file should only contain samples that are pertinent to the file itself. Hope to be clear, if not let me know and I will try to formulate better the question.
For the following, I will assume your output/file_{n}.txt
are inputs, otherwise you will need to utilize checkpoints for the rule that creates them.
Don't be overwhelmed with rules and expands. At the end, rule all
's input is a list of files you want. You can make that list with any valid python. Let's say you don't know the values for n
ahead of time, but you have a directory of files that have samples in them.
def use_ind(n):
return [line.strip() for line in open(f'output/file_{n}.txt')]
n_values = glob_wildcards('output/file_{n}.txt').n
target_files = [f'output/file_{n}_{sample}.vcf.gz'
for n in n_values
for sample in use_ind(n)]
rule all:
input: target_files
You should also try to have the outputs of different rules go into different directories so wildcard matching is more likely to work how you expect.