PASA的使用

前文讲述了:PASA的安装,配置与主程序使用参数。本文则重点讲述PASA的几种平行的用法:

最简单地理解PASA的用法:PASA是通过调用 blat 或 gsnap 来将 transcripts sequences 比对到 genome 上,从而进行基因的结构注释(gff文件形式)。

1. 普通用法

使用transcripts sequences来进行基因预测,需要2个(或3个)输入的文件:

The genome sequence in a multiFasta file (ie. genome.fasta)

The transcript sequences in a multiFasta file (ie. transcripts.fasta)

Optional: a file containing the list of accessions corresponding to full-length cDNAs (ie. FL_accs.txt)该文件是第二个文件中属于全长CDNAs的序列名的集合,txt文档。

其使用步骤为:

1.1 使用seqclean来对transcript sequences进行clean

通过污染数据库vectors.fasta来对transcript sequences可能被污染的数据进行清除。

$ seqclean transcripts.fasta -v vectors.fasta

1.2 运行PASA

$ cp $PASA_HOME/pasa_conf/pasa.alignAssembly.Template.txt alignAssembly.config
$ perl -p -i -e 's/MYSQLDB=.*/MYSQLDB=sample_mydb_pasa/' alignAssembly.config
$ $PASA_HOME/scripts/Launch_PASA_pipeline.pl -c alignAssembly.config -C -R -g genome_sample.fasta -t all_transcripts.fasta.clean -T -u all_transcripts.fasta -f FL_accs.txt --ALIGNERS blat,gmap --CPU 2

1.3 PASA结果

sample_mydb_pasa.assemblies.fasta : the PASA assemblies in FASTA format.

sample_mydb_pasa.pasa_assemblies.gff3,.gtf,.bed : the PASA assembly structures.

sample_mydb_pasa.pasa_alignment_assembly_building.ascii_illustrations.out : descriptions of alignment assemblies and how they were constructed from the underlying transcript alignments.

sample_mydb_pasa.pasa_assemblies_described.txt : tab-delimited format describing the contents of the PASA assemblies, including the identity of those transcripts that were assembled into the corresponding structure.

2. PASA结合RNA-seq数据来进行基因预测

由RNA-seq数据可以得到transcript sequences,以RNA-seq数据使用Trinity组装过后的转录组结果可以按第一种方法进行PASA分析。在此,更好的方法是: 利用PASA,使用De novo转录组和genome来建立一个综合转录组数据库。

需要2个(或3个)的输入文件:

1. Trinity de novo RNA-Seq assemblies (ex. Trinity.fasta)
2. Trinity genome-guided RNA-Seq assemblies (ex. Trinity.GG.fasta)
3. (optionally)cufflinks transcript structures (ex. cufflinks.gtf).

使用步骤:
2.1 Concatenate the Trinity.fasta and Trinity.GG.fasta files into a single transcripts.fasta file.

Trinity.GG.fasta文件可以是 Trinity genome-guided 过程中产生的文件Trinity_GG.fasta, 如果使用最后PASA得到的assembly结果文件,则需要修改fasta文件的头。

cat Trinity.fasta Trinity.GG.fasta > transcripts.fasta

2.2 Create a file containing the list of transcript accessions that correspond to the Trinity de novo assembly (full de novo, not genome-guided).

$PASA_HOME/misc_utilities/accession_extractor.pl < Trinity.fasta > tdn.accs

2.3 Run PASA using RNA-Seq related options as described in the section above, but include the parameter setting –TDN tdn.accs. To (optionally) include cufflinks-generated transcript structures, further include the parameter setting –cufflinks_gtf cufflinks.gtf. Note, cufflinks may not be appropriate for gene-dense targets, such as in fungi; cufflinks excels when applied to vertebrate genomes, so best to include when applying to mouse or human.

$ $PASA_HOME/scripts/Launch_PASA_pipeline.pl -c alignAssembly.config -t transcripts.fasta -C -R -g genome_sample.fasta --ALIGNERS blat,gmap --CPU 2 --TDN tdn.accs --cufflinks_gtf cufflinks.gtf

2.4 After completing the PASA alignment assembly, generate the comprehensive transcriptome database via:

$PASA_HOME/PASA/scripts/build_comprehensive_transcriptome.dbi -c alignAssembly.config -t transcripts.fasta --min_per_ID 95 --min_per_aligned 30

2.5 结果文件
compreh_init_build/compreh_init_build.fasta : the transcript sequences
compreh_init_build/compreh_init_build.geneToTrans_mapping : the gene/transcript mapping file (for use with RSEM, Trinotate, other tools)

compreh_init_build/compreh_init_build.bed : transcript structures in bed format
compreh_init_build/compreh_init_build.gff3 : transcript structures in gff3 format

compreh_init_build/compreh_init_build.details : classifications of transcripts according to genome mapping status.
2.6 使用感受:

建立一个综合的transcritome数据库,起始是没将de novo的transcripts用于PASA的运行,用于PASA运行是的Trinity_GG.fasta。最后的结果fasta文件是单独使用Trinity_GG.fasta运行PASA得到的gff文件,fasta文件是单独运行得到的assembly.fasta文件和Trinity.fasta相加的结果。个人觉得这样做没什么意义!虽然,文档写道这样做的意义是有利于下游基因差异表达的分析。还不如就使用Trinity_GG.fasta单独去运行PASA。

3. Annotation Comparisons and Annotation Updates

Incorporating PASA Assemblies into Existing Gene Predictions, Changing Exons, Adding UTRs and Alternatively Spliced Models

The PASA software can update any preexisting set of protein-coding gene annotations to incorporate the PASA alignment evidence, correcting exon boundaries, adding UTRs, and models for alternative splicing based on the PASA alignment assemblies generated above.

步骤如下:

3.1 将protein-coding gene annotations上载到数据库

验证gff3文件的兼容性,如果会报告错误和警示,则需要修改gff3文件.

$ $PASA_HOME/misc_utilities/pasa_gff3_validator.pl orig_annotations_sample.gff3

将基因注释文件orig_annotations_sample.gff3上载到数据库。note that you should only feed protein-coding genes to PASA using the loader。

$ $PASA_HOME/scripts/Load_Current_Gene_Annotations.dbi -c alignAssembly.config -g genome_sample.fasta -P orig_annotations_sample.gff3

继续检查一遍兼容性,上载之后,则不会有任何返回到标准输出。

3.2 Performing an annotation comparison and generating an updated gene set

运行PASA主程序,使用注释的config文件(将其中的MYSQLDB的值更改的和比对的config文件一致)

$PASA_HOME/scripts/Launch_PASA_pipeline.pl -c annotCompare.config -A -g genome_sample.fasta -t all_transcripts.fasta.clean

3.3 运行结果

Once the annotation comparison is complete, PASA will output a new GFF3 file that contains the PASA-updated version of the genome annotation, including those gene models successfully updated by PASA, and those that remained untouched. This file will be named ${mysql_db}.gene_structures_post_PASA_updates.$pid.gff3, where $pid is the process ID for this annotation comparison computation.

当然,也可以再次看看PASA的Web Portal.报告已经发生了改变。

4. Identification and Classification of All Alternative Splicing Variations

在PASA进行transcripts比对到genome时,可以进行可变剪切预测。加入选项–ALT_SPLICE。

5. Extraction of ORFs from PASA assemblies (reference ORFs for training gene predictors)

PASA可以从用于自动提取编码蛋白基因的基因结构,用于训练基因集合,来用于其它基因预测软件。To do this, the longest ORF (min 100 amino acids, configurable) is found within each PASA alignment assembly, allowing for 5′ and 3′ partials. The top 500 longest ORFs (configurable) are extracted and, using these in addition to the full set of genome sequences, log-likelihood scores for a hexamer-based Markov model are computed. All ORFs are scored using this Markov model, and those ORFs that are most likely coding are reported as best. The scoring method for ORF identification is similar to that used in the GeneID software and nicely described in the methods section of the GeneID publication.

在运行过PASA过后,则可以进行ORFs的提取:

$ export PERL5LIB $PERL5LIB:$PASAHOME/PerlLib/
$ $PASAHOME/scripts/pasa_asmbls_to_training_set.dbi -M "$mysql_db_name:$mysql_server_name" -p "$user:$password" -g genome_sample.fasta

运行结果:
trainingSetCandidates.cds,.pep,.gff3 :corresponds to all longest ORFs of minimum length extracted from the PASA alignment assemblies.

trainingSetCandidates.cds.top_500_longest :corresponds to the 500 top longest ORFs, thought to be most likely to represent correct protein-coding genes.

trainingSetCandidates.cds.scores
:scores for all ORFs based on the Markov model.

best_candidates.cds,.pep,.gff3,.bed :the best candidates based on their coding scores compared to random sequences of the same nucleotide composition.

发表评论

您的电子邮箱地址不会被公开。 必填项已用*标注

此站点使用Akismet来减少垃圾评论。了解我们如何处理您的评论数据