RNA-seq序列比对工具-HISAT2

hisat2是一款短序列比对的工具，主要用于RNA-seq数据的比对，hisat2使用改进的BWT算法，实现了更快的速度和更少的资源占用。下面简单介绍如何使用hisat2进行转录组的数据比对。

1）下载和安装

Source	https://cloud.biohpc.swmed.edu/index.php/s/fE9QCsX3NH4QwBi/download
OSX_x86_64	https://cloud.biohpc.swmed.edu/index.php/s/zMgEtnF6LjnjFrr/download
Linux_x86_64	https://cloud.biohpc.swmed.edu/index.php/s/oTtGWbWjaxsQ2Ho/download

2）建立索引

和其他的比对软件一样，hisat2也需要建立自己的索引，我们可以下载官方已经建立好的索引（如H. sapiens索引）：

GRCh38

genome	https://genome-idx.s3.amazonaws.com/hisat/grch38_genome.tar.gz
genome_snp	https://genome-idx.s3.amazonaws.com/hisat/grch38_snp.tar.gz
genome_tran	https://genome-idx.s3.amazonaws.com/hisat/grch38_tran.tar.gz
genome_snp_tran	https://genome-idx.s3.amazonaws.com/hisat/grch38_snptran.tar.gz

UCSC hg38

genome	https://genome-idx.s3.amazonaws.com/hisat/hg38_genome.tar.gz
genome_tran	https://genome-idx.s3.amazonaws.com/hisat/hg38_tran.tar.gz

GRCh37

genome	https://genome-idx.s3.amazonaws.com/hisat/grch37_genome.tar.gz
genome_snp	https://genome-idx.s3.amazonaws.com/hisat/grch37_snp.tar.gz
genome_tran	https://genome-idx.s3.amazonaws.com/hisat/grch37_tran.tar.gz
genome_snp_tran	https://genome-idx.s3.amazonaws.com/hisat/grch37_snptran.tar.gz

UCSC hg19

genome

https://genome-idx.s3.amazonaws.com/hisat/hg19_genome.tar.gz

更多物种索引下载：https://daehwankimlab.github.io/hisat2/download/#index

也可以自己手动构建索引，以构建人的索引为例：


# 下载基因组序列
wget ftp://ftp.ensembl.org/pub/release-84/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
gzip -d Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
mv Homo_sapiens.GRCh38.dna.primary_assembly.fa genome.fa

# 下载注释文件，HFM索引可跳过
wget ftp://ftp.ensembl.org/pub/release-84/gtf/homo_sapiens/Homo_sapiens.GRCh38.84.gtf.gz  
gzip -d Homo_sapiens.GRCh38.84.gtf.gz
mv Homo_sapiens.GRCh38.84.gtf genome.gtf
hisat2_extract_splice_sites.py genome.gtf > genome.ss
hisat2_extract_exons.py genome.gtf > genome.exon

# 下载SNP数据，HFM索引可跳过
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg38/database/snp144Common.txt.gz
gzip -d snp144Common.txt.gz
# 转换UCSC Database 到 Ensembl的注释
awk 'BEGIN{OFS="\t"} {if($2 ~ /^chr/) {$2 = substr($2, 4)}; if($2 == "M") {$2 = "MT"} print}' snp144Common.txt > snp144Common.txt.ensembl
# 生成SNP和单倍型文件
hisat2_extract_snps_haplotypes_UCSC.py genome.fa snp144Common.txt.ensembl genome

# 建立建立HFM索引（至少需要6GB）
hisat2-build -p 16 genome.fa genome

# 建立含SNP的HGFM索引
hisat2-build -p 16 --snp genome.snp --haplotype genome.haplotype genome.fa genome_snp

# 建立含转录本的HGFM索引（至少需要160G内存）
hisat2-build -p 16 --exon genome.exon --ss genome.ss genome.fa genome_tran

# 建立含转录本和SNPs的HGFM索引（至少需要160G内存）
hisat2-build -p 16 --snp genome.snp --haplotype genome.haplotype --exon genome.exon --ss genome.ss genome.fa genome_snp_tran

3）序列比对

索引建立完成后，运行hisat2进行序列比对：


hisat2 [options]* -x <hisat2-idx> {-1 <m1> -2 <m2> | -U <r> | --sra-acc <SRA accession number>} [-S <hit>]

主要参数：
-x <hisat2-idx>
参考基因组索引文件的前缀。
-1 <m1>
双端测序结果的第一个文件。若有多组数据，使用逗号将文件分隔。Reads的长度可以不一致。
-2 <m2>
双端测序结果的第二个文件。若有多组数据，使用逗号将文件分隔，并且文件顺序要和-1参数对应。Reads的长度可以不一致。
-U <r>
单端数据文件。若有多组数据，使用逗号将文件分隔。可以和-1、-2参数同时使用。Reads的长度可以不一致。
–sra-acc <SRA accession number>
输入SRA登录号，多组数据之间使用逗号分隔。HISAT将自动下载并识别数据类型，进行比对。
-S <hit>
指定输出的SAM文件。
输入选项：
-q
输入文件为FASTQ格式。FASTQ格式为默认参数。
-qseq
输入文件为QSEQ格式。
-f
输入文件为FASTA格式。
-r
输入文件中，每一行代表一条序列，没有序列名和测序质量等。选择此项时，–ignore-quals参数也会被选择。
-c
此参数后是直接比对的序列，而不是包含序列的文件名。序列间用逗号隔开。选择此项时，–ignore-quals参数也会被选择。
-s/–skip <int>
跳过输入文件中前条序列进行比对。
-u/–qupto <int>
只使用输入文件中前条序列进行比对，默认是没有限制。
-5/–trim5 <int>
比对前去除每条序列5’端个碱基
-3/–trim3 <int>
比对前去除每条序列3’端个碱基
–phred33
输入的FASTQ文件碱基质量值编码标准为phred33，phred33为默认参数。
–phred64
输入的FASTQ文件碱基质量值编码标准为phred64。

hisat2 -p 16 -x ./grch38_tran/genome_tran -1 sample_1.fastq -2 sample_2.fastq –S sample.sam

运行后结果显示内容如下所示，展示本次比对的情况：


10000 reads; of these:
  10000 (100.00%) were paired; of these:
    650 (6.50%) aligned concordantly 0 times
    8823 (88.23%) aligned concordantly exactly 1 time
    527 (5.27%) aligned concordantly >1 times
    ----
    650 pairs aligned concordantly 0 times; of these:
      34 (5.23%) aligned discordantly 1 time
    ----
    616 pairs aligned 0 times concordantly or discordantly; of these:
      1232 mates make up the pairs; of these:
        660 (53.57%) aligned 0 times
        571 (46.35%) aligned exactly 1 time
        1 (0.08%) aligned >1 times
96.70% overall alignment rate

4）分析组合

目前主流的分析组合如下：

Hisat2 + stringtie + ballgown/edgeR /DESeq2

Hisat2 + HTSeq + edgeR /DESeq2

参考资料：

1.http://daehwankimlab.github.io/hisat2/manual/

2.Kim, D., Paggi, J.M., Park, C. et al. Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nat Biotechnol 37, 907–915 (2019). https://doi.org/10.1038/s41587-019-0201-4

阅读: 4,229

Omics - Hunter

RNA-seq转录本组装和定量软件-stringtie

RNA-seq序列比对工具-STAR

发表回复取消回复

Omics - Hunter

RNA-seq序列比对工具-HISAT2

RNA-seq转录本组装和定量软件-stringtie

RNA-seq序列比对工具-STAR

发表回复 取消回复

发表回复取消回复