之前的文章《Bioperl-开源生命科学工具套件》介绍了BioPerl的整个蓝图,Bio::SeqIO是其一个子模块,可以根据输入文件类型抽取出所需要的信息,其相关模块Bio::Seq则可以按照格式要求储存数据信息,一般处理常用的生信文件格式会调用到。
下面我们利用Bio::SeqIO来快速解析fasta文件格式,示例如下:
#!/usr/bin/perl -w
use warnings;
use strict;
use autodie;
# 载入关键包,安装:`cpan Bio::SeqIO`
use Bio::SeqIO;
# 载入demo序列信息
# 文件状态
#'file' # 打开只读文件
#'>file' # 写入文件
#'>>file' # 向文件追加内容
#'+<file' # 可以读写操作文件
my $in = Bio::SeqIO->new(-file => "demo.fasta" ,
-format => 'Fasta');
# 读取序列
while ( my $seq = $in->next_seq() ) {
# 获取ID
print "ID: ", $seq->display_id, "\n";
# Accession number
print "Accession number: ", $seq->accession_number(), "\n";
# 判断序列是DNA、RNA、Protein
print "Is DNA? ", ($seq->alphabet eq 'dna') ? "True":"False", "\n";
print "Is Protein? ", ($seq->alphabet eq 'protein') ? "True":"False", "\n";
# 获取序列
print "Seq > \n", $seq->seq(), "\n";
# 获取物种信息
print "Species: ", $seq->species(), "\n";
# 获取序列其他信息
print "SeqFeatures: ", $seq->get_all_SeqFeatures;
# 目的是演示,读取一个后跳出循环,具体环境中根据需要去掉last
last;
}应用方向还有如文件格式转换,序列抽取等等,目前支持的格式如下:
| 名称 | 描述 | 文件后缀 |
|---|---|---|
| abi | ABI tracefile | ab[i1] |
| ace | Ace database | ace |
| agave | AGAVE XML | |
| alf | ALF tracefile | alf |
| asciitree | write-only, to visualize features | |
| bsml | BSML using | bsm,bsml |
| bsml_sax | BSML, using | |
| chadoxml | CHADO sequence format | |
| chaos | CHAOS sequence format | |
| chaosxml | Chaos XML | |
| ctf | CTF tracefile | ctf |
| embl | EMBL database | embl,ebl,emb,dat |
| entrezgene | Entrez Gene ASN1 | |
| excel | Excel | |
| exp | Staden EXP format | exp |
| fasta | FASTA | fasta,fast,seq,fa,fsa,nt,aa |
| fastq | quality score data in FASTA-like format | fastq |
| flybase_chadoxml | variant of Chado XML | |
| game | GAME XML | |
| gcg | GCG | gcg |
| genbank | GenBank | gb |
| interpro | InterProScan XML | |
| kegg | KEGG | |
| largefasta | Large files, fasta format | |
| lasergene | Lasergene format | |
| locuslink | LocusLink | |
| metafasta | ||
| phd | Phred | phd,phred |
| pir | PIR database | pir |
| pln | PLN tracefile | pln |
| qual | Phred | |
| raw | plain text | txt |
| scf | Standard Chromatogram Format | scf |
| seqxml | SeqXML sequence format | xml |
| strider | DNA Strider format | |
| swiss | SwissProt | swiss,sp |
| tab | tab-delimited | |
| table | Table | |
| tigr | TIGR XML | |
| tigrxml | TIGR Coordset XML | |
| tinyseq | NCBI TinySeq XML | |
| ztr | ZTR tracefile | ztr |
参考资料:
1.https://metacpan.org/pod/Bio::SeqIO
2.https://bioperl.org/howtos/SeqIO_HOWTO.html

浙公网安备 33010802011761号