Incorporating Illumina RNAseq into AUGUSTUS with Tophat

Incorporating Illumina RNAseq into AUGUSTUS with Tophat

#!/bin/bash

###### Step 1 ---------------------

# Aligning reads with Bowtie/Tophat
bowtie2-build genome.fasta.masked genome
tophat -r 20 -mate-std-dev 40 --coverage-search --microexon-search -p 24 --library-type fr-unstranded --phred64-quals genome reads_1.fq reads_2.fq 

# Filtering raw alignments
samtools sort -n tophat_out/accepted_hits.bam tophat_out/accepted_hits.s
augustus.2.7/auxprogs/filterBam/bin/filterBam --uniq --paired --in tophat_out/accepted_hits.s.bam --out tophat_out/accepted_hits.sf.bam

samtools view -H tophat_out/accepted_hits.sf.bam > header.txt

# Creating intron hints
samtools sort tophat_out/accepted_hits.sf.bam both.ssf
augustus.2.7/auxprogs/bam2hints/bam2hints --intronsonly --in=both.ssf.bam --out=hints.gff

# Run Augustus
augustus.2.7/bin/augustus --species=SPECIES --extrinsicCfgFile=extrinsic.cfg --alternatives-from-evidence=true --hintsfile=hints.gff --allow_hinted_splicesites=atac --introns=on --genemodel=complete genome.fasta > aug1.out
######-----------------------------

###### Step 2 ---------------------

# Create an exon-exon junction database
cat aug1.out | tee aug.prelim.gff | grep -P "\tintron\t" > aug1.introns.gff
cat hints.gff aug1.introns.gff | perl -ne 'split; print "@_[0]:@_[3]-@_[4]\n";' | sort -u > introns.lst
augustus.2.7/scripts/intron2exex.pl --flank=90 --introns=introns.lst --seq=genome.fasta.masked --exex=exex.fa --map=map.psl
bowtie2-build exex.fa genome_exex1

# Aligning reads with Bowtie
bowtie2 -q --phred64 --fr -p 24 --reorder -x genome_exex1 -1 reads_1.fq -2 reads_2.fq -S bowtie.sam
samtools view -S -F 4 bowtie.sam > bowtie.F.sam
augustus.2.7/scripts/samMap.pl bowtie.F.sam map.psl > bowtie.global.sam 2> bowtie.samMap.err
# 需要将samMap.pl中的75修正为需要的reads长度。注意使用 2> 来重定向STDERR数据,以提高运行速度。

# Join data from step 1 and step 2
bamtools filter -in tophat_out/accepted_hits.bam -out tophat_out/accepted_hits.noN.bam -script augustus.2.7/auxprogs/auxBamFilters/operation_N_filter.txt
cat header.txt bowtie.global.sam > bowtie.global.h.sam
samtools view -bS -o bowtie.global.h.bam bowtie.global.h.sam
bamtools merge -in bowtie.global.h.bam -in tophat_out/accepted_hits.noN.bam -out both.bam
samtools sort -n both.bam both.s

# Filtering raw alignments
augustus.2.7/auxprogs/filterBam/bin/filterBam --uniq --paired --in both.s.bam --out both.sf.bam

# Creating intron hints
samtools sort both.sf.bam both.ssf
augustus.2.7/auxprogs/bam2hints/bam2hints --intronsonly --in=both.ssf.bam --out=hints.2.gff

#Run Augustus
augustus --species=yourSpecies --extrinsicCfgFile=extrinsic.cfg --alternatives-from-evidence=true --hintsfile=hints.2.gff --allow_hinted_splicesites=atac genome.fasta > aug2.out

Augustus的安装和使用参数

AUGUSTUS is a program that predicts genes in eukaryotic genomic sequences.

1. Augustus的安装

Augustus下载:http://bioinf.uni-greifswald.de/augustus/binaries/

$ wget http://bioinf.uni-greifswald.de/augustus/binaries/augustus.2.7.tar.gz
$ tar zxf augustus.2.7.tar.gz
$ cd augustus.2.7
$ cd src
$ make -j 8
$ export AUGUSTUS_CONFIG_PATH=$PWD/../config/ (可以加入到.bashrc中)

2. Augustus使用方法

2.1 基因预测例子

$ augustus --strand=both --genemode=partial --singlestrand=false --hintsfile=hints.gff --extrinsicCfgFile=extrinsic.cfg --protein=on --introns=on --start=on --stop=on --cds=on --codingseq=on --alternatives-from-evidence=true --gff3=on --UTR=on ----outfile=out.gff --species=human genome.fa
$ augustus --noprediction=true --species=SPECIES sequences.gb

2.2 Augustus使用参数

Usage:

augustus [parameters] --sepcies=SPECIES queryfilename

重要参数:

--strand=both, --strand=forward or --strand=backward
    report predicted genes on both strands, just the forward or 
just the backward strand.default is 'both'

--genemodel=partial, --genemodel=intronless, --genemodel=complete, 
--genemodel=atleastone or --genemodel=exactlyone
    partial : allow prediction of incomplete genes at the sequence boundaries (default)
    intronless : only predict single-exon genes like in prokaryotes and some eukaryotes
    complete : only predict complete genes
    atleastone : predict at least one complete gene
    exactlyone : predict exactly one complete gene

--singlestrand=true
    predict genes independently on each strand, allow overlapping
 genes on opposite strands. This option is turned off by default.

--hintsfile=hintsfilename
    When this option is used the prediction considering hints (ex
trinsic information) is turned on. hintsfilename contains the hints
 in gff format.

--extrinsicCfgFile=cfgfilename
    Optional. This file contains the list of used sources for the 
hints and their boni and mali. If not specified the file "extrin
sic.cfg" in the config directory $AUGUSTUS_CONFIG_PATH is used.

--maxDNAPieceSize=n
    This value specifies the maximal length of the pieces that the 
sequence is cut into for the core algorithm (Viterbi) to be run. 
Default is --maxDNAPieceSize=200000.
    AUGUSTUS tries to place the boundaries of these pieces in the 
intergenic region, which is inferred by a preliminary prediction. 
GC-content dependent parameters are chosen for each piece of DNA 
if /Constant/decomp_num_steps > 1 for that species. This is why 
this value should not be set very large, even if you have plenty 
of memory.

--protein=on/off
--introns=on/off
--start=on/off
--stop=on/off
--cds=on/off
--codingseq=on/off
    Output options. Output predicted protein sequence, introns, 
start codons, stop codons. Or use 'cds' in addition to 'initial', 
'internal', 'terminal' and 'single' exon. The CDS excludes the 
stop codon (unless stopCodonExcludedFromCDS=false) whereas the 
terminal and single exon include the stop codon.

--AUGUSTUS_CONFIG_PATH=path
    path to config directory (if not specified as environment var
iable)

--alternatives-from-evidence=true/false
    report alternative transcripts when they are suggested by hints

--alternatives-from-sampling=true/false
    report alternative transcripts generated through probabilistic 
sampling

--sample=n
--minexonintronprob=p
--minmeanexonintronprob=p
--maxtracks=n

--proteinprofile=filename
Read a protein profile from file filename. See section 7 below.

--predictionStart=A, --predictionEnd=B
    A and B define the range of the sequence for which predictions 
should be found. Quicker if you need predictions only for a small 
part.

--gff3=on/off
    output in gff3 format.

--UTR=on/off
    predict the untranslated regions in addition to the coding 
sequence. This currently works only for human, galdieria, toxopl
asma and caenorhabditis.

--outfile=filename
    print output to filename instead to standard output. This is 
useful for computing environments, e.g. parasol jobs, which do 
not allow shell redirection.

--noInFrameStop=true/false
    Don't report transcripts with in-frame stop codons. Otherwise, 
intron-spanning stop codons could occur. Default: false

--noprediction=true/false
    If true and input is in genbank format, no prediction is made. 
Useful for getting the annotated protein sequences. Augustus也可以以
genebank格式文件为输入文件,进行基因预测,并将预测结果和genebank的结果进行比较后
得出一个精确性的统计结果。
    当然,由于genebank格式文件中有些sequences没有cds的注释结果,因此可以使用该
参数进行检测,从而得到没有cds的序列号,在人为去去除这些没有cds注释的序列,再去进行
预测准确性的评估。

--contentmodels=on/off
    If 'off' the content models are disabled (all emissions unif
ormly 1/4). The content models are; coding region Markov chain 
(emiprobs), initial k-mers in coding region (Pls), intron and int
ergenic regin Markov chain. This option is intended for special 
applications that require judging gene structures from the signal 
models only, e.g. for predicting the effect of SNPs or mutations 
on splicing. For all typical gene predictions, this should be 
true. Default: on

--paramlist
    For a complete list of parameters, type "augustus --paramlist"

gene structure prediction methods

基因预测的方法有很多。维基百科:http://en.wikipedia.org/wiki/List_of_gene_prediction_softwarehttp://www.geneprediction.org上提供了大量的基因预测的软件和指导书籍

以下介绍主要的真核生物基因预测的几个方法:

1. Augustus

AUGUSTUS ususally belongs to the most accurate programs for the species it is trained for. Often it is the most accurate ab initio program.

AUGUSTUS的发展很快,版本不断更新,功能越来越完善。AUGUSTUS能很好地结合RNA-seq数据得到的hints.gff进行基因预测。

2. GeneMark-ES

GeneMark-ES使用起来最简单,唯一的能以基因组自我训练出训练集用于基因预测。该软件不提供使用已有模型来预测基因。对应的使用模型的版本为GeneMark.hmm eukaryotic,但是能使用的模型太少,不提供其建模的方法。

3. FGENESH

FGENSH是商业软件,由Softberry公司负责维护和发布,用户只能联网小量使用。若需要大量使用或本地化运行,则需要付费。找了半天,官网不提供软件的下载。使用网页版的FGENESH去进行基因组基本的基因预测,很长时间后返回错误:Request not processed 413 Request entity too large。

4. mGene[.web]

虽然是新近的软件,但是该软件现在还不成熟,使用不易。该软件精确度貌似较高,特别是在下线虫中,和Augustus差不多了。

5. SNAP

Ian Korf独自写了该软件,2004年发表在BMC Bioinformatics上,引用次数208.

该软件安装简单,软件说明较少,作者只是简单在README中提到怎么去建HMM的模型。

6. Glimmer[HMM]

使用Glimmerhmm,应该还不如使用SNAP。

7. geneid

geneid的training的文档中写得很是繁琐,不想看。