snakemake-workflow管理工具

生信流程构建有很多方法，其中snakemake是诸多方法中比较小清新的一款流程搭建管理系统，和之前介绍的WDL《WDL(Workflow Description Language)搭建pipeline利器》有诸多相似之处，也有很多优势：

轻便，基于python构建，可以和python无缝衔接
支持任务投递系统，可以直接通过snakemake提交到任务系统
支持断点运行
支持多线程（这个功能基本的workflow管理工具都支持）

安装snakemake：


# 推荐用conda管理环境
conda create -n snakemake snakemake
# 创建好环境后，进入环境
conda activate snakemake

snakemake语法：


snakemake  = statement | rule | include | workdir
rule       = "rule" (identifier | "") ":" ruleparams
include    = "include:" stringliteral
workdir    = "workdir:" stringliteral
ni         = NEWLINE INDENT
ruleparams = [ni input] [ni output] [ni params] [ni message] [ni threads] [ni (run | shell)] NEWLINE snakemake
input      = "input" ":" parameter_list
output     = "output" ":" parameter_list
params     = "params" ":" parameter_list
log        = "log" ":" parameter_list
benchmark  = "benchmark" ":" statement
cache      = "cache" ":" bool
message    = "message" ":" stringliteral
threads    = "threads" ":" integer
resources  = "resources" ":" parameter_list
version    = "version" ":" statement
run        = "run" ":" ni statement
shell      = "shell" ":" stringliteral

snakemake常用参数


--snakefile, -s        指定Snakefile，否则是当前目录下的Snakefile
--dryrun, -n           不真正执行，一般用来检查Snakefile是否有错
--printshellcmds, -p   输出要执行的shell命令
--reason, -r           输出每条rule执行的原因,默认FALSE
--cores, --jobs, -j    指定运行的核数，若不指定，则使用最大的核数
--force, -f            重新运行第一条rule或指定的rule
--forceall, -F         重新运行所有的rule，不管是否已经有输出结果
--forcerun, -R         重新执行Snakefile，当更新了rule时候使用此命令

下面我们用一个简单的示例介绍其用法：

1）创建生信分析环境environment.yaml（推荐用conda创建和管理环境）


channels:
  - conda-forge
  - bioconda
dependencies:
  - python =3.6
  - jinja2 =2.10
  - networkx =2.1
  - matplotlib =2.2.3
  - graphviz =2.38.0
  - bcftools =1.9
  - samtools =1.9
  - bwa =0.7.17
  - pysam =0.15.0

2）构建配置文件config.yaml（配置文件格式可以是JOSN、YAML）


samples:
    A: data/samples/A.fastq
    B: data/samples/B.fastq

3）定义规则（workflow）


configfile: "config.yaml"


rule all:
    input:
        "plots/quals.svg"


rule bwa_map:  # rule关键字，定义每个pipe
    input:     # 定义输入
        "data/genome.fa",
        lambda wildcards: config["samples"][wildcards.sample]
    output:    # 定义输出
        temp("mapped_reads/{sample}.bam")
    conda:     # 选择环境
        "environment.yaml"
    params:    # 参数设置
        rg=r"@RG\tID:{sample}\tSM:{sample}"
    log:       # 日志
        "logs/bwa_mem/{sample}.log"
    threads: 8 # 线程数
    shell:     # shell命令，也可以用script来执行脚本，或者直接用run来执行python代码
        "(bwa mem -R '{params.rg}' -t {threads} {input} | "
        "samtools view -Sb - > {output}) 2> {log}"


rule samtools_sort:
    input:
        "mapped_reads/{sample}.bam"
    output:
        protected("sorted_reads/{sample}.bam")
    conda:
        "environment.yaml"
    shell:
        "samtools sort -T sorted_reads/{wildcards.sample} "
        "-O bam {input} > {output}"


rule samtools_index:
    input:
        "sorted_reads/{sample}.bam"
    output:
        "sorted_reads/{sample}.bam.bai"
    conda:
        "environment.yaml"
    shell:
        "samtools index {input}"


rule bcftools_call:
    input:
        fa="data/genome.fa",
        bam=expand("sorted_reads/{sample}.bam", sample=config["samples"]),
        bai=expand("sorted_reads/{sample}.bam.bai", sample=config["samples"])
    output:
        "calls/all.vcf"
    conda:
        "environment.yaml"
    shell:
        "samtools mpileup -g -f {input.fa} {input.bam} | "
        "bcftools call -mv - > {output}"


rule plot_quals:
    input:
        "calls/all.vcf"
    output:
        "plots/quals.svg"
    conda:
        "environment.yaml"
    script:
        "scripts/plot-quals.py"

4）执行任务

 snakemake --use-conda --snakefile snakefile

参考资料：

1.https://snakemake.readthedocs.io/

阅读: 888

Omics - Hunter

R KNN（K-Nearest-Neighbors）算法

R 图片拼接之cowplot

发表回复取消回复

Omics - Hunter

snakemake-workflow管理工具

R KNN（K-Nearest-Neighbors）算法

R 图片拼接之cowplot

发表回复 取消回复

发表回复取消回复