生信流程构建有很多方法,其中snakemake是诸多方法中比较小清新的一款流程搭建管理系统,和之前介绍的WDL《WDL(Workflow Description Language)搭建pipeline利器》有诸多相似之处,也有很多优势:

  • 轻便,基于python构建,可以和python无缝衔接
  • 支持任务投递系统,可以直接通过snakemake提交到任务系统
  • 支持断点运行
  • 支持多线程(这个功能基本的workflow管理工具都支持)

安装snakemake:


# 推荐用conda管理环境
conda create -n snakemake snakemake
# 创建好环境后,进入环境
conda activate snakemake

snakemake语法:


snakemake  = statement | rule | include | workdir
rule       = "rule" (identifier | "") ":" ruleparams
include    = "include:" stringliteral
workdir    = "workdir:" stringliteral
ni         = NEWLINE INDENT
ruleparams = [ni input] [ni output] [ni params] [ni message] [ni threads] [ni (run | shell)] NEWLINE snakemake
input      = "input" ":" parameter_list
output     = "output" ":" parameter_list
params     = "params" ":" parameter_list
log        = "log" ":" parameter_list
benchmark  = "benchmark" ":" statement
cache      = "cache" ":" bool
message    = "message" ":" stringliteral
threads    = "threads" ":" integer
resources  = "resources" ":" parameter_list
version    = "version" ":" statement
run        = "run" ":" ni statement
shell      = "shell" ":" stringliteral

snakemake常用参数


--snakefile, -s        指定Snakefile,否则是当前目录下的Snakefile
--dryrun, -n           不真正执行,一般用来检查Snakefile是否有错
--printshellcmds, -p   输出要执行的shell命令
--reason, -r           输出每条rule执行的原因,默认FALSE
--cores, --jobs, -j    指定运行的核数,若不指定,则使用最大的核数
--force, -f            重新运行第一条rule或指定的rule
--forceall, -F         重新运行所有的rule,不管是否已经有输出结果
--forcerun, -R         重新执行Snakefile,当更新了rule时候使用此命令

下面我们用一个简单的示例介绍其用法:

1)创建生信分析环境environment.yaml(推荐用conda创建和管理环境)


channels:
  - conda-forge
  - bioconda
dependencies:
  - python =3.6
  - jinja2 =2.10
  - networkx =2.1
  - matplotlib =2.2.3
  - graphviz =2.38.0
  - bcftools =1.9
  - samtools =1.9
  - bwa =0.7.17
  - pysam =0.15.0

2)构建配置文件config.yaml(配置文件格式可以是JOSN、YAML)


samples:
    A: data/samples/A.fastq
    B: data/samples/B.fastq

3)定义规则(workflow)


configfile: "config.yaml"


rule all:
    input:
        "plots/quals.svg"


rule bwa_map:  # rule关键字,定义每个pipe
    input:     # 定义输入
        "data/genome.fa",
        lambda wildcards: config["samples"][wildcards.sample]
    output:    # 定义输出
        temp("mapped_reads/{sample}.bam")
    conda:     # 选择环境
        "environment.yaml"
    params:    # 参数设置
        rg=r"@RG\tID:{sample}\tSM:{sample}"
    log:       # 日志
        "logs/bwa_mem/{sample}.log"
    threads: 8 # 线程数
    shell:     # shell命令,也可以用script来执行脚本,或者直接用run来执行python代码
        "(bwa mem -R '{params.rg}' -t {threads} {input} | "
        "samtools view -Sb - > {output}) 2> {log}"


rule samtools_sort:
    input:
        "mapped_reads/{sample}.bam"
    output:
        protected("sorted_reads/{sample}.bam")
    conda:
        "environment.yaml"
    shell:
        "samtools sort -T sorted_reads/{wildcards.sample} "
        "-O bam {input} > {output}"


rule samtools_index:
    input:
        "sorted_reads/{sample}.bam"
    output:
        "sorted_reads/{sample}.bam.bai"
    conda:
        "environment.yaml"
    shell:
        "samtools index {input}"


rule bcftools_call:
    input:
        fa="data/genome.fa",
        bam=expand("sorted_reads/{sample}.bam", sample=config["samples"]),
        bai=expand("sorted_reads/{sample}.bam.bai", sample=config["samples"])
    output:
        "calls/all.vcf"
    conda:
        "environment.yaml"
    shell:
        "samtools mpileup -g -f {input.fa} {input.bam} | "
        "bcftools call -mv - > {output}"


rule plot_quals:
    input:
        "calls/all.vcf"
    output:
        "plots/quals.svg"
    conda:
        "environment.yaml"
    script:
        "scripts/plot-quals.py"

4)执行任务

 snakemake --use-conda --snakefile snakefile

参考资料:

1.https://snakemake.readthedocs.io/