rMATS是一款对RNA-Seq数据进行差异可变剪切分析的软件。rMATS可识别的可变剪切事件有5种:

  • skipped exon (SE)外显子跳跃
  • alternative 5′ splice site (A5SS)第一个外显子可变剪切
  • alternative 3′ splice site (A3SS)最后一个外显子可变剪切
  • mutually exclusive exons (MXE)外显子选择性跳跃
  • retained intron (RI)内含子滞留

目前最新版是rMATS 4.1.1,安装方式如下:

一些依赖环境:

  • Python (3.6.12 or 2.7.15)
    • Cython (0.29.21 or 0.29.15 for Python 2)
    • numpy (1.16.6 or 1.16.5 for Python 2)
  • BLAS, LAPACK
  • GNU Scientific Library (GSL 2.5)
  • GCC (>=5.4.0)
  • gfortran (Fortran 77)
  • CMake (3.15.4)
  • PAIRADISE (可选)
  • Samtools (建议安装,如果需要从原始fq文件开始分析)
  • STAR (建议安装,如果需要从原始fq文件开始分析)

下载源代码(https://github.com/Xinglab/rmats-turbo/releases),然后进行编译:

./build_rmats

编译完成后可以用python rmats.py {arguments}

python rmats.py -h

usage: rmats.py [options]

optional arguments:
  -h, --help            show this help message and exit
  --version             show program's version number and exit
  --gtf GTF             An annotation of genes and transcripts in GTF format
  --b1 B1               A text file containing a comma separated list of the
                        BAM files for sample_1. (Only if using BAM)
  --b2 B2               A text file containing a comma separated list of the
                        BAM files for sample_2. (Only if using BAM)
  --s1 S1               A text file containing a comma separated list of the
                        FASTQ files for sample_1. If using paired reads the
                        format is ":" to separate pairs and "," to separate
                        replicates. (Only if using fastq)
  --s2 S2               A text file containing a comma separated list of the
                        FASTQ files for sample_2. If using paired reads the
                        format is ":" to separate pairs and "," to separate
                        replicates. (Only if using fastq)
  --od OD               The directory for final output from the post step
  --tmp TMP             The directory for intermediate output such as ".rmats"
                        files from the prep step
  -t {paired,single}    Type of read used in the analysis: either "paired" for
                        paired-end data or "single" for single-end data.
                        Default: paired
  --libType {fr-unstranded,fr-firststrand,fr-secondstrand}
                        Library type. Use fr-firststrand or fr-secondstrand
                        for strand-specific data. Default: fr-unstranded
  --readLength READLENGTH
                        The length of each read
  --variable-read-length
                        Allow reads with lengths that differ from --readLength
                        to be processed. --readLength will still be used to
                        determine IncFormLen and SkipFormLen
  --anchorLength ANCHORLENGTH
                        The anchor length. Default is 1
  --tophatAnchor TOPHATANCHOR
                        The "anchor length" or "overhang length" used in the
                        aligner. At least "anchor length" NT must be mapped to
                        each end of a given junction. The default is 6. (Only
                        if using fastq)
  --bi BINDEX           The directory name of the STAR binary indices (name of
                        the directory that contains the SA file). (Only if
                        using fastq)
  --nthread NTHREAD     The number of threads. The optimal number of threads
                        should be equal to the number of CPU cores. Default: 1
  --tstat TSTAT         The number of threads for the statistical model. If
                        not set then the value of --nthread is used
  --cstat CSTAT         The cutoff splicing difference. The cutoff used in the
                        null hypothesis test for differential splicing. The
                        default is 0.0001 for 0.01% difference. Valid: 0 <=
                        cutoff < 1. Does not apply to the paired stats model
  --task {prep,post,both,inte,stat}
                        Specify which step(s) of rMATS to run. Default: both.
                        prep: preprocess BAMs and generate a .rmats file.
                        post: load .rmats file(s) into memory, detect and
                        count alternative splicing events, and calculate P
                        value (if not --statoff). both: prep + post. inte
                        (integrity): check that the BAM filenames recorded by
                        the prep task(s) match the BAM filenames for the
                        current command line. stat: run statistical test on
                        existing output files
  --statoff             Skip the statistical analysis
  --paired-stats        Use the paired stats model
  --novelSS             Enable detection of novel splice sites (unannotated
                        splice sites). Default is no detection of novel splice
                        sites
  --mil MIL             Minimum Intron Length. Only impacts --novelSS
                        behavior. Default: 50
  --mel MEL             Maximum Exon Length. Only impacts --novelSS behavior.
                        Default: 500
  --allow-clipping      Allow alignments with soft or hard clipping to be used
  --fixed-event-set FIXED_EVENT_SET
                        A directory containing fromGTF.[AS].txt files to be
                        used instead of detecting a new set of events

另外也可以用docker安装该应用:docker pull xinglab/rmats:v4.1.1

1)从fastq文件开始分析:

假设有如下数据:

  • group 1 FASTQ文件
    • /path/to/1_1.R1.fastq
    • /path/to/1_1.R2.fastq
    • /path/to/1_2.R1.fastq
    • /path/to/1_2.R2.fastq
  • group 2 FASTQ文件
    • /path/to/2_1.R1.fastq
    • /path/to/2_1.R2.fastq
    • /path/to/2_2.R1.fastq
    • /path/to/2_2.R2.fastq

将上述文件列表整理到文本中,用: 来分割双端测序文件, 用, 来分开生物/技术重复:

如:

  • /path/to/s1.txt内容如下:

/path/to/1_1.R1.fastq:/path/to/1_1.R2.fastq,/path/to/1_2.R1.fastq:/path/to/1_2.R2.fastq
  • /path/to/s2.txt内容如下:

/path/to/2_1.R1.fastq:/path/to/2_1.R2.fastq,/path/to/2_2.R1.fastq:/path/to/2_2.R2.fastq

然后执行:


# 执行命令
python rmats.py 
    --s1 /path/to/s1.txt            # 输入sample1的txt格式的文件
    --s2 /path/to/s2.txt 
    --gtf /path/to/the.gtf          # gtf文件
    --bi /path/to/STAR_binary_index # STAR索引文件
    -t paired                       # 双端测序则paired或单端测序single
    --readLength 50                 # 测序读长
    --nthread 4                     # 分析线程数
    --od /path/to/output            # 输出文件的路径
    --tmp /path/to/tmp_output       # 缓存文件路径

2)从bam文件开始分析(推荐用STAR比对后的bam文件):

假设有如下数据:

  • group 1 BAM文件
    • /path/to/1_1.bam
    • /path/to/1_2.bam
  • group 2 BAM文件
    • /path/to/2_1.bam
    • /path/to/2_2.bam

将上述文件整理到文本中,用,分割:

  • /path/to/b1.txt
/path/to/1_1.bam,/path/to/1_2.bam
  • /path/to/b2.txt
/path/to/2_1.bam,/path/to/2_2.bam

然后执行命令:


# 执行命令
python rmats.py 
    --s1 /path/to/s1.txt            # 输入sample1的txt格式的文件
    --s2 /path/to/s2.txt 
    --gtf /path/to/the.gtf          # gtf文件
    -t paired                       # 双端测序则paired或单端测序single
    --readLength 50                 # 测序读长
    --nthread 4                     # 分析线程数
    --od /path/to/output            # 输出文件的路径
    --tmp /path/to/tmp_output       # 缓存文件路径

rMATS的结果文件是以各个可变剪切事件分别输出的,主要由[AS_Event].MATS.JC.txt,[AS_Event].MATS.JCEC.txt,fromGTF.[AS_Event].txt,JC.raw.input.[AS_Event].txt,JCEC.raw.input.[AS_Event].txt这几类,详细介绍以及文件内容介绍见:https://github.com/Xinglab/rmats-turbo/blob/v4.1.1/README.md#output

分析完成后,我们可以用rmats2sashimiplot(https://github.com/Xinglab/rmats2sashimiplot)进行数据的可视化,如我们对上述的结果文件进行可视化:


python rmats2sashimiplot.py 
--s1 ./rmats2sashimiplot_test_data/sample_1_replicate_1.sam,./rmats2sashimiplot_test_data/sample_1_replicate_2.sam,./rmats2sashimiplot_test_data/sample_1_replicate_3.sam 
--s2 ./rmats2sashimiplot_test_data/sample_2_replicate_1.sam,./rmats2sashimiplot_test_data/sample_2_replicate_2.sam,./rmats2sashimiplot_test_data/sample_2_replicate_3.sam 
-t SE 
-e ./rmats2sashimiplot_test_data/SE.MATS.JC.txt 
--l1 SampleOne 
--l2 SampleTwo 
--exon_s 1 
--intron_s 5 
-o test_events_output

参考资料:

1.http://rnaseq-mats.sourceforge.net/

2.https://github.com/Xinglab/rmats-turbo/blob/v4.1.1/README.md