之前的文章《Bioperl-开源生命科学工具套件》介绍了BioPerl的整个蓝图,Bio::SeqIO是其一个子模块,可以根据输入文件类型抽取出所需要的信息,其相关模块Bio::Seq则可以按照格式要求储存数据信息,一般处理常用的生信文件格式会调用到。
下面我们利用Bio::SeqIO来快速解析fasta文件格式,示例如下:
#!/usr/bin/perl -w
use warnings;
use strict;
use autodie;
# 载入关键包,安装:`cpan Bio::SeqIO`
use Bio::SeqIO;
# 载入demo序列信息
# 文件状态
#'file' # 打开只读文件
#'>file' # 写入文件
#'>>file' # 向文件追加内容
#'+<file' # 可以读写操作文件
my $in = Bio::SeqIO->new(-file => "demo.fasta" ,
-format => 'Fasta');
# 读取序列
while ( my $seq = $in->next_seq() ) {
# 获取ID
print "ID: ", $seq->display_id, "\n";
# Accession number
print "Accession number: ", $seq->accession_number(), "\n";
# 判断序列是DNA、RNA、Protein
print "Is DNA? ", ($seq->alphabet eq 'dna') ? "True":"False", "\n";
print "Is Protein? ", ($seq->alphabet eq 'protein') ? "True":"False", "\n";
# 获取序列
print "Seq > \n", $seq->seq(), "\n";
# 获取物种信息
print "Species: ", $seq->species(), "\n";
# 获取序列其他信息
print "SeqFeatures: ", $seq->get_all_SeqFeatures;
# 目的是演示,读取一个后跳出循环,具体环境中根据需要去掉last
last;
}
应用方向还有如文件格式转换,序列抽取等等,目前支持的格式如下:
名称 | 描述 | 文件后缀 |
---|---|---|
abi | ABI tracefile | ab[i1] |
ace | Ace database | ace |
agave | AGAVE XML | |
alf | ALF tracefile | alf |
asciitree | write-only, to visualize features | |
bsml | BSML using | bsm,bsml |
bsml_sax | BSML, using | |
chadoxml | CHADO sequence format | |
chaos | CHAOS sequence format | |
chaosxml | Chaos XML | |
ctf | CTF tracefile | ctf |
embl | EMBL database | embl,ebl,emb,dat |
entrezgene | Entrez Gene ASN1 | |
excel | Excel | |
exp | Staden EXP format | exp |
fasta | FASTA | fasta,fast,seq,fa,fsa,nt,aa |
fastq | quality score data in FASTA-like format | fastq |
flybase_chadoxml | variant of Chado XML | |
game | GAME XML | |
gcg | GCG | gcg |
genbank | GenBank | gb |
interpro | InterProScan XML | |
kegg | KEGG | |
largefasta | Large files, fasta format | |
lasergene | Lasergene format | |
locuslink | LocusLink | |
metafasta | ||
phd | Phred | phd,phred |
pir | PIR database | pir |
pln | PLN tracefile | pln |
qual | Phred | |
raw | plain text | txt |
scf | Standard Chromatogram Format | scf |
seqxml | SeqXML sequence format | xml |
strider | DNA Strider format | |
swiss | SwissProt | swiss,sp |
tab | tab-delimited | |
table | Table | |
tigr | TIGR XML | |
tigrxml | TIGR Coordset XML | |
tinyseq | NCBI TinySeq XML | |
ztr | ZTR tracefile | ztr |
参考资料:
1.https://metacpan.org/pod/Bio::SeqIO
2.https://bioperl.org/howtos/SeqIO_HOWTO.html