htseq-count 多线程分析转录组数据

在转录组数据分析中htseq-count在之前是被广泛的使用，后来由于出现了像featurecounts等一系列的软件，htseq-count的热度渐渐降下来了，最主要的原因就是-“慢”。

之前的版本，htseq-count无法利用多线程工作，导致其在处理SAM文件上和计算Reads上速度大打折扣。网络上htseq-count的陈旧教程很多，但是最新版的htseq-count已经支持多线程工作了，本文对新版的htseq-count做一个简单的总结。

最新版（0.12.3）特性：

Negative indices for StepVector (thanks to shouldsee for the original PR).
htseq-count-barcodes counts features in barcoded SAM/BAM files, e.g. 10X Genomics
single cell outputs. It supports cell barcodes, which result in different columns of
the output count table, and unique molecular identifiers.
htseq-count has new option -n for multicore parallel processing
htseq-count has new option -d for separating output columns by arbitrary character
(defalt TAB, , is also common)
htseq-count has new option -c for output into a file instead of stdout
htseq-count has new option --append-output for output into a file by appending to
any existing test (e.g. a header with the feature attribute names and sample names)
htseq-count has two new values for option --nonunique, namely fraction, which
will count an N-multimapper as 1/N for each feature, and random, which will assign
the alignment to a random one of its N-multimapped features. This feature was added by
ewallace (thank you!).
htseq-qa got refactored and now accepts an options --primary-only which ignores
non-primary alignments in SAM/BAM files. This means that the final number of alignments
scored is equal to the number of reads even when multimapped reads are present.

1）进行定量分析需要两个文件：BAM/SAM文件（双端测序，一般用sort后的BAM文件）、GFF注释文件。

2）htseq-count内置3种计算Count模式：union（默认）、intersection-strict、intersection-nonempty

3）参数介绍：

-f <format>, --format=<format>：输入数据的格式，sam或者bam文件。

-r <order>, --order=<order>：设置sam或bam文件的排序方式，该参数的值可以是name或pos。前者表示按read名进行排序，后者表示按比对的参考基因组位置进行排序。若测序数据是双末端测序，当输入sam/bam文件是按pos方式排序的时候，两端reads的比对结果在sam/bam文件中一般不是紧邻的两行，程序会将reads对的第一个比对结果放入内存，直到读取到另一端read的比对结果。因此，选择pos可能会导致程序使用较多的内存，它也适合于未排序的sam/bam文件。而pos排序则表示程序认为双末端测序的reads比对结果在紧邻的两行上，也适合于单端测序的比对结果。很多其它表达量分析软件要求输入的sam/bam文件是按pos排序的，但HTSeq推荐使用name排序，且一般比对软件的默认输出结果也是按name进行排序的。

--max-reads-in-buffer=<number：默认是 30000000。允许多少Reads保留在内存中，直到匹配到为止（提高该数目将占用更多的内存）。对按 name 排序的单端或双端测序无效

-s <yes/no/reverse>, --stranded=<yes/no/reverse> 默认yes 设置是否是链特异性测序。该参数的值可以是yes,no或reverse。no表示非链特异性测序；若是单端测序，yes表示read比对到了基因的正义链上；若是双末端测序，yes表示read1比对到了基因正义链上，read2比对到基因负义链上；reverse表示双末端测序情况下与yes值相反的结果。根据说明文件的理解，一般情况下双末端链特异性测序，该参数的值应该选择reverse（本人暂时没有测试该参数）。

-a <minaqual>, --a=<minaqual> 默认是10。忽略比对质量低于此值的比对结果。在0.5.4版本以前该参数默认值是0。

-t <feature type>, --type=<feature type> 默认是exon，程序会对该指定的feature（gtf/gff文件第三列）进行表达量计算，而gtf/gff文件中其它的feature都会被忽略。

-i <id attribute>, --idattr=<id attribute> 默认是gene_id。设置feature ID是由gtf/gff文件第9列那个标签决定的；若gtf/gff文件多行具有相同的feature ID，则它们来自同一个feature，程序会计算这些features的表达量之和赋给相应的feature ID。

--additional-attr=<id attributes>：默认值为none 。附加属性，作为附加列打印在主属性列之后但在计数列之前的信息。

-m <mode>, --mode=<mode> 默认是union，设置表达量计算模式。该参数的值可以有union, intersection-strict and intersection-nonempty。这三种模式的选择请见上面对这3种模式的示意图。从图中可知，对于原核生物，推荐使用intersection-strict模式；对于真核生物，推荐使用union模式。

--nonunique=<nonunique mode>：默认值为none 。用于处理重叠Reads区域的模式，值为none或者all。

--secondary-alignments=<mode>：默认值是score。处理次要比对方式的模式（SAM标志0x100），可以是 score或者ignore。

--supplementary-alignments=<mode>：默认值是score。处理互补/嵌合比对模式（SAM标志0x800），可以是 score或者ignore。

-o <samout>, --samout=<samout> 输出一个sam文件，该sam文件的比对结果中多了一个XF标签，表示该read比对到了某个feature上。

-n <n>, --nprocesses=<n>：默认是1。设置线程数

-p <samout_format>, --samout-format=<samout_format>：

-q, --quiet ：不输出程序运行的状态信息和警告信息。

-h, --help ：输出帮助信息。

--version ：显示软件版本

参考资料：

1.https://htseq.readthedocs.io/en/master/count.html

2.https://github.com/htseq/htseq/releases

阅读: 1,894

Omics - Hunter

GCC 多版本安装管理

序列查找比对工具-NCBI Blast+

2 评论

kichen

陈浩

发表回复取消回复