Wildcards
One of the more powerful features of Snakemake is its ability to use use and interpret wildcards. Wildcards are placeholders for data set names and come in very handy when analyzing viruses that exist in multiple lineages or have multiple segments, for example seasonal influenza viruses:
rule align:
input:
sequences="results/{virus}_{segment}_sequences.fasta",
reference="config/{virus}_{segment}_reference.gb"
output:
alignment = "results/alignment_{virus}_{segment}.fasta"
shell:
"""
augur align \
--sequences {input.sequences} \
--reference-sequence {input.reference} \
--output {output.alignment}
"""
This defines a generic rule to align sequences of a particular segment belonging to a particular viral lineage. Such generic rules can be combined with data set specific data and functionality using python syntax. If we for example wanted to specify a data set specific evolutionary rate, we could do this as follows:
def clock_rate(wildcards):
rate = {
('h3n2', 'ha'): 0.0043, ('h3n2', 'na'):0.0029,
('h1n1pdm', 'ha'): 0.0040, ('h1n1pdm', 'na'):0.0032,
('vic', 'ha'): 0.0024, ('vic', 'na'):0.0015,
('yam', 'ha'): 0.0019, ('yam', 'na'):0.0013
}
return rate.get((wildcards.virus, wildcards.segment), 0.001)
rule refine:
input:
tree = rules.tree.output.tree,
alignment = rules.align.output,
metadata = "data/{virus}_{segment}_metadata.tsv"
output:
tree = "results/tree_{virus}_{segment}.nwk",
node_data = "results/branchlengths_{virus}_{segment}.json"
params:
clock_rate = clock_rate
shell:
"""
augur refine \
--tree {input.tree} \
--alignment {input.alignment} \
--metadata {input.metadata} \
--timetree \
--clock-rate {params.clock_rate} \
--output-tree {output.tree} \
--output-node-data {output.node_data}
"""
If we now asked snakemake to produce the file results/tree_h3n2_ha.nwk
, it would determine the wildcards virus=h3n2
and segment=ha
from the filename and then use the right input files and parameters.
As clock rate input parameter takes the form of a function in this case (defined above the rule) that takes the wildcards as parameters.
Default rules
The first rule in a Snakefile serves are as the default rule that is executed when no targets are specified.
segments = ['pb2', 'pb1', 'pa', 'ha', 'np', 'mp', 'na', 'ns']
viruses = ['vic', 'yam', 'h1n1pdm', 'h3n2']
rule all_auspice_files:
input:
expand("auspice/{virus}_{segment}_tree.json", segment=segments, virus=viruses)
[...]
This rule all_auspice_files
only specifies input files. These input files are the files we want our pipeline to spit out and by making this rule the first rule in the Snakefile, it will be the default rule.
Calling snakemake
bare without specifying a target will then run all rules necessary to produce the input files of the default rule.