Incorporating Illumina RNAseq into AUGUSTUS with Tophat

发表于2013 年 5 月 7 日由chenlianfu

#!/bin/bash

###### Step 1 ---------------------

# Aligning reads with Bowtie/Tophat
bowtie2-build genome.fasta.masked genome
tophat -r 20 -mate-std-dev 40 --coverage-search --microexon-search -p 24 --library-type fr-unstranded --phred64-quals genome reads_1.fq reads_2.fq 

# Filtering raw alignments
samtools sort -n tophat_out/accepted_hits.bam tophat_out/accepted_hits.s
augustus.2.7/auxprogs/filterBam/bin/filterBam --uniq --paired --in tophat_out/accepted_hits.s.bam --out tophat_out/accepted_hits.sf.bam

samtools view -H tophat_out/accepted_hits.sf.bam > header.txt

# Creating intron hints
samtools sort tophat_out/accepted_hits.sf.bam both.ssf
augustus.2.7/auxprogs/bam2hints/bam2hints --intronsonly --in=both.ssf.bam --out=hints.gff

# Run Augustus
augustus.2.7/bin/augustus --species=SPECIES --extrinsicCfgFile=extrinsic.cfg --alternatives-from-evidence=true --hintsfile=hints.gff --allow_hinted_splicesites=atac --introns=on --genemodel=complete genome.fasta > aug1.out
######-----------------------------

###### Step 2 ---------------------

# Create an exon-exon junction database
cat aug1.out | tee aug.prelim.gff | grep -P "\tintron\t" > aug1.introns.gff
cat hints.gff aug1.introns.gff | perl -ne 'split; print "@_[0]:@_[3]-@_[4]\n";' | sort -u > introns.lst
augustus.2.7/scripts/intron2exex.pl --flank=90 --introns=introns.lst --seq=genome.fasta.masked --exex=exex.fa --map=map.psl
bowtie2-build exex.fa genome_exex1

# Aligning reads with Bowtie
bowtie2 -q --phred64 --fr -p 24 --reorder -x genome_exex1 -1 reads_1.fq -2 reads_2.fq -S bowtie.sam
samtools view -S -F 4 bowtie.sam > bowtie.F.sam
augustus.2.7/scripts/samMap.pl bowtie.F.sam map.psl > bowtie.global.sam 2> bowtie.samMap.err
# 需要将samMap.pl中的75修正为需要的reads长度。注意使用 2> 来重定向STDERR数据，以提高运行速度。

# Join data from step 1 and step 2
bamtools filter -in tophat_out/accepted_hits.bam -out tophat_out/accepted_hits.noN.bam -script augustus.2.7/auxprogs/auxBamFilters/operation_N_filter.txt
cat header.txt bowtie.global.sam > bowtie.global.h.sam
samtools view -bS -o bowtie.global.h.bam bowtie.global.h.sam
bamtools merge -in bowtie.global.h.bam -in tophat_out/accepted_hits.noN.bam -out both.bam
samtools sort -n both.bam both.s

# Filtering raw alignments
augustus.2.7/auxprogs/filterBam/bin/filterBam --uniq --paired --in both.s.bam --out both.sf.bam

# Creating intron hints
samtools sort both.sf.bam both.ssf
augustus.2.7/auxprogs/bam2hints/bam2hints --intronsonly --in=both.ssf.bam --out=hints.2.gff

#Run Augustus
augustus --species=yourSpecies --extrinsicCfgFile=extrinsic.cfg --alternatives-from-evidence=true --hintsfile=hints.2.gff --allow_hinted_splicesites=atac genome.fasta > aug2.out

RepeatMasker的安装和使用

发表于2013 年 5 月 7 日由chenlianfu

Augustus的安装和使用参数

发表于2013 年 5 月 7 日由chenlianfu

AUGUSTUS is a program that predicts genes in eukaryotic genomic sequences.

1. Augustus的安装

Augustus下载：http://bioinf.uni-greifswald.de/augustus/binaries/

$ wget http://bioinf.uni-greifswald.de/augustus/binaries/augustus.2.7.tar.gz
$ tar zxf augustus.2.7.tar.gz
$ cd augustus.2.7
$ cd src
$ make -j 8
$ export AUGUSTUS_CONFIG_PATH=$PWD/../config/ (可以加入到.bashrc中）

2. Augustus使用方法

2.1 基因预测例子

$ augustus --strand=both --genemode=partial --singlestrand=false --hintsfile=hints.gff --extrinsicCfgFile=extrinsic.cfg --protein=on --introns=on --start=on --stop=on --cds=on --codingseq=on --alternatives-from-evidence=true --gff3=on --UTR=on ----outfile=out.gff --species=human genome.fa
$ augustus --noprediction=true --species=SPECIES sequences.gb

2.2 Augustus使用参数

Usage:

augustus [parameters] --sepcies=SPECIES queryfilename

重要参数：

--strand=both, --strand=forward or --strand=backward
report predicted genes on both strands, just the forward or
just the backward strand.default is 'both'

--genemodel=partial, --genemodel=intronless, --genemodel=complete,
--genemodel=atleastone or --genemodel=exactlyone
partial : allow prediction of incomplete genes at the sequence boundaries (default)
intronless : only predict single-exon genes like in prokaryotes and some eukaryotes
complete : only predict complete genes
atleastone : predict at least one complete gene
exactlyone : predict exactly one complete gene

--singlestrand=true
predict genes independently on each strand, allow overlapping
genes on opposite strands. This option is turned off by default.

--hintsfile=hintsfilename
When this option is used the prediction considering hints (ex
trinsic information) is turned on. hintsfilename contains the hints
in gff format.

--extrinsicCfgFile=cfgfilename
Optional. This file contains the list of used sources for the
hints and their boni and mali. If not specified the file "extrin
sic.cfg" in the config directory $AUGUSTUS_CONFIG_PATH is used.

--maxDNAPieceSize=n
This value specifies the maximal length of the pieces that the
sequence is cut into for the core algorithm (Viterbi) to be run.
Default is --maxDNAPieceSize=200000.
AUGUSTUS tries to place the boundaries of these pieces in the
intergenic region, which is inferred by a preliminary prediction.
GC-content dependent parameters are chosen for each piece of DNA
if /Constant/decomp_num_steps > 1 for that species. This is why
this value should not be set very large, even if you have plenty
of memory.

--protein=on/off
--introns=on/off
--start=on/off
--stop=on/off
--cds=on/off
--codingseq=on/off
Output options. Output predicted protein sequence, introns,
start codons, stop codons. Or use 'cds' in addition to 'initial',
'internal', 'terminal' and 'single' exon. The CDS excludes the
stop codon (unless stopCodonExcludedFromCDS=false) whereas the
terminal and single exon include the stop codon.

--AUGUSTUS_CONFIG_PATH=path
path to config directory (if not specified as environment var
iable)

--alternatives-from-evidence=true/false
report alternative transcripts when they are suggested by hints

--alternatives-from-sampling=true/false
report alternative transcripts generated through probabilistic
sampling

--sample=n
--minexonintronprob=p
--minmeanexonintronprob=p
--maxtracks=n

--proteinprofile=filename
Read a protein profile from file filename. See section 7 below.

--predictionStart=A, --predictionEnd=B
A and B define the range of the sequence for which predictions
should be found. Quicker if you need predictions only for a small
part.

--gff3=on/off
output in gff3 format.

--UTR=on/off
predict the untranslated regions in addition to the coding
sequence. This currently works only for human, galdieria, toxopl
asma and caenorhabditis.

--outfile=filename
print output to filename instead to standard output. This is
useful for computing environments, e.g. parasol jobs, which do
not allow shell redirection.

--noInFrameStop=true/false
Don't report transcripts with in-frame stop codons. Otherwise,
intron-spanning stop codons could occur. Default: false

--noprediction=true/false
If true and input is in genbank format, no prediction is made.
Useful for getting the annotated protein sequences. Augustus也可以以
genebank格式文件为输入文件，进行基因预测，并将预测结果和genebank的结果进行比较后
得出一个精确性的统计结果。
当然，由于genebank格式文件中有些sequences没有cds的注释结果，因此可以使用该
参数进行检测，从而得到没有cds的序列号，在人为去去除这些没有cds注释的序列，再去进行
预测准确性的评估。

--contentmodels=on/off
If 'off' the content models are disabled (all emissions unif
ormly 1/4). The content models are; coding region Markov chain
(emiprobs), initial k-mers in coding region (Pls), intron and int
ergenic regin Markov chain. This option is intended for special
applications that require judging gene structures from the signal
models only, e.g. for predicting the effect of SNPs or mutations
on splicing. For all typical gene predictions, this should be
true. Default: on

--paramlist
For a complete list of parameters, type "augustus --paramlist"

gene structure prediction methods

发表于2013 年 5 月 7 日由chenlianfu

基因预测的方法有很多。维基百科：http://en.wikipedia.org/wiki/List_of_gene_prediction_software；http://www.geneprediction.org上提供了大量的基因预测的软件和指导书籍

以下介绍主要的真核生物基因预测的几个方法：

1. Augustus

AUGUSTUS ususally belongs to the most accurate programs for the species it is trained for. Often it is the most accurate ab initio program.

AUGUSTUS的发展很快，版本不断更新，功能越来越完善。AUGUSTUS能很好地结合RNA-seq数据得到的hints.gff进行基因预测。

2. GeneMark-ES

GeneMark-ES使用起来最简单，唯一的能以基因组自我训练出训练集用于基因预测。该软件不提供使用已有模型来预测基因。对应的使用模型的版本为GeneMark.hmm eukaryotic，但是能使用的模型太少，不提供其建模的方法。

3. FGENESH

FGENSH是商业软件，由Softberry公司负责维护和发布，用户只能联网小量使用。若需要大量使用或本地化运行，则需要付费。找了半天，官网不提供软件的下载。使用网页版的FGENESH去进行基因组基本的基因预测，很长时间后返回错误：Request not processed 413 Request entity too large。

4. mGene[.web]

虽然是新近的软件，但是该软件现在还不成熟，使用不易。该软件精确度貌似较高，特别是在下线虫中，和Augustus差不多了。

5. SNAP

Ian Korf独自写了该软件，2004年发表在BMC Bioinformatics上，引用次数208.

该软件安装简单，软件说明较少，作者只是简单在README中提到怎么去建HMM的模型。

6. Glimmer[HMM]

使用Glimmerhmm，应该还不如使用SNAP。

7. geneid

geneid的training的文档中写得很是繁琐，不想看。

一	二	三	四	五	六	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

陈连福的生信博客

第22期培训班将于2024.01.27-2024.02.05期间在武汉市举办！

日度归档：2013 年 5 月 7 日

Incorporating Illumina RNAseq into AUGUSTUS with Tophat

RepeatMasker的安装和使用

Augustus的安装和使用参数

1. Augustus的安装

2. Augustus使用方法

2.1 基因预测例子

2.2 Augustus使用参数

gene structure prediction methods

1. Augustus

2. GeneMark-ES

3. FGENESH

4. mGene[.web]

5. SNAP

6. Glimmer[HMM]

7. geneid