Augustus的安装和使用参数

AUGUSTUS is a program that predicts genes in eukaryotic genomic sequences.

1. Augustus的安装

Augustus下载:http://bioinf.uni-greifswald.de/augustus/binaries/

$ wget http://bioinf.uni-greifswald.de/augustus/binaries/augustus.2.7.tar.gz
$ tar zxf augustus.2.7.tar.gz
$ cd augustus.2.7
$ cd src
$ make -j 8
$ export AUGUSTUS_CONFIG_PATH=$PWD/../config/ (可以加入到.bashrc中)

2. Augustus使用方法

2.1 基因预测例子

$ augustus --strand=both --genemode=partial --singlestrand=false --hintsfile=hints.gff --extrinsicCfgFile=extrinsic.cfg --protein=on --introns=on --start=on --stop=on --cds=on --codingseq=on --alternatives-from-evidence=true --gff3=on --UTR=on ----outfile=out.gff --species=human genome.fa
$ augustus --noprediction=true --species=SPECIES sequences.gb

2.2 Augustus使用参数

Usage:

augustus [parameters] --sepcies=SPECIES queryfilename

重要参数:

--strand=both, --strand=forward or --strand=backward
    report predicted genes on both strands, just the forward or 
just the backward strand.default is 'both'

--genemodel=partial, --genemodel=intronless, --genemodel=complete, 
--genemodel=atleastone or --genemodel=exactlyone
    partial : allow prediction of incomplete genes at the sequence boundaries (default)
    intronless : only predict single-exon genes like in prokaryotes and some eukaryotes
    complete : only predict complete genes
    atleastone : predict at least one complete gene
    exactlyone : predict exactly one complete gene

--singlestrand=true
    predict genes independently on each strand, allow overlapping
 genes on opposite strands. This option is turned off by default.

--hintsfile=hintsfilename
    When this option is used the prediction considering hints (ex
trinsic information) is turned on. hintsfilename contains the hints
 in gff format.

--extrinsicCfgFile=cfgfilename
    Optional. This file contains the list of used sources for the 
hints and their boni and mali. If not specified the file "extrin
sic.cfg" in the config directory $AUGUSTUS_CONFIG_PATH is used.

--maxDNAPieceSize=n
    This value specifies the maximal length of the pieces that the 
sequence is cut into for the core algorithm (Viterbi) to be run. 
Default is --maxDNAPieceSize=200000.
    AUGUSTUS tries to place the boundaries of these pieces in the 
intergenic region, which is inferred by a preliminary prediction. 
GC-content dependent parameters are chosen for each piece of DNA 
if /Constant/decomp_num_steps > 1 for that species. This is why 
this value should not be set very large, even if you have plenty 
of memory.

--protein=on/off
--introns=on/off
--start=on/off
--stop=on/off
--cds=on/off
--codingseq=on/off
    Output options. Output predicted protein sequence, introns, 
start codons, stop codons. Or use 'cds' in addition to 'initial', 
'internal', 'terminal' and 'single' exon. The CDS excludes the 
stop codon (unless stopCodonExcludedFromCDS=false) whereas the 
terminal and single exon include the stop codon.

--AUGUSTUS_CONFIG_PATH=path
    path to config directory (if not specified as environment var
iable)

--alternatives-from-evidence=true/false
    report alternative transcripts when they are suggested by hints

--alternatives-from-sampling=true/false
    report alternative transcripts generated through probabilistic 
sampling

--sample=n
--minexonintronprob=p
--minmeanexonintronprob=p
--maxtracks=n

--proteinprofile=filename
Read a protein profile from file filename. See section 7 below.

--predictionStart=A, --predictionEnd=B
    A and B define the range of the sequence for which predictions 
should be found. Quicker if you need predictions only for a small 
part.

--gff3=on/off
    output in gff3 format.

--UTR=on/off
    predict the untranslated regions in addition to the coding 
sequence. This currently works only for human, galdieria, toxopl
asma and caenorhabditis.

--outfile=filename
    print output to filename instead to standard output. This is 
useful for computing environments, e.g. parasol jobs, which do 
not allow shell redirection.

--noInFrameStop=true/false
    Don't report transcripts with in-frame stop codons. Otherwise, 
intron-spanning stop codons could occur. Default: false

--noprediction=true/false
    If true and input is in genbank format, no prediction is made. 
Useful for getting the annotated protein sequences. Augustus也可以以
genebank格式文件为输入文件,进行基因预测,并将预测结果和genebank的结果进行比较后
得出一个精确性的统计结果。
    当然,由于genebank格式文件中有些sequences没有cds的注释结果,因此可以使用该
参数进行检测,从而得到没有cds的序列号,在人为去去除这些没有cds注释的序列,再去进行
预测准确性的评估。

--contentmodels=on/off
    If 'off' the content models are disabled (all emissions unif
ormly 1/4). The content models are; coding region Markov chain 
(emiprobs), initial k-mers in coding region (Pls), intron and int
ergenic regin Markov chain. This option is intended for special 
applications that require judging gene structures from the signal 
models only, e.g. for predicting the effect of SNPs or mutations 
on splicing. For all typical gene predictions, this should be 
true. Default: on

--paramlist
    For a complete list of parameters, type "augustus --paramlist"

发表评论

您的电子邮箱地址不会被公开。 必填项已用*标注

此站点使用Akismet来减少垃圾评论。了解我们如何处理您的评论数据