1. EVM下载和安装

从EVM-sourceforge-download中下载EVM，并解压缩后直接用于EVM的使用。

2. EVM的使用步骤

2.1 收集gff3格式的基因注释内容

在EVM中，基因注释分为3种，并将所有的注释信息（无论使用那种方法进行注释的）归入到3个文件中：gene_predictions.gff3、protein_alignments.gff3 和 transcript_alignments.gff3 这三个文件中。当然这三个文件的名称可以是其它的名字，但是EVM只是接受这3种文件。

$ cat genemark_hmm.gff3 snap.gff3 aug2.out.gfff3 > gene_predictions.gff3
$ cat LEdodesGGTrinity.pasa_assemblies.gff3 > transcript_alignments.gff3
$ perl -p -i -e 's/^#.*//s' gene_predictions.gff3 transcript_alignments.gff3

2.2 制作Evidence Weights File

3种注释信息有这不同的可信度。依靠自己的直觉来设置不同注释信息的权重。由于PASA是依赖于转录信息的，其权重高；而ab initio预测的则要低。

比如：我使用转录组测序的数据进行了PASA预测；通过PASA的结果，使用Augustus构建HMM模型，结合转录组数据构建hints来进行基因预测；通过PASA结果，使用SNAP来构建HMM模型，进行ab initio基因预测；使用GeneMark_ES进行自我训练建模来ab initio基因预测。

以上4中方法中，分别的权重值则是：PASA > Augustus > SNAP > GeneMark_ES。

一般的建议是： weight(pasa) >> weight (protein) >= weight(prediction)

由于使用了转录组的数据，我对以上4个预测方法的权重是：

echo "ABINITIO_PREDICTION   AUGUSTUS    6
ABINITIO_PREDICTION GeneMark.hmm    1
ABINITIO_PREDICTION SNAP    2
TRANSCRIPT  assembler-LEdodesGGTrinity  10" > weights.txt

该权重文件格式为：三列；第一列有三种，分别是ABINITIO_PREDICTION、PROTEIN 和 TRANSCRIPT；第二列是gff3文件中对应的type一列，即gff3文件的第3列；第3列则是权重大小。

3.3 制作PASA-supported terminal exons supplement

该终止外显子的信息文件，是可选的，并推荐使用。但是可能由于PASA最新版本的程序和EVM使用文档中的描述的不一致，该文件的制作，有点难度。最后制作出来了，使用该文件的时候，却是程序不能运行。需要再次摸索。就不阐述了。略过该步骤即可。

3.4 运行EVM的分块程序

使用 partition_EVM_inputs.pl 程序将输入的需要预测的scaffolds[或chromosomes、contigs]分开到单独的文件夹中。结果得到很多个文件夹，每个文件夹对应1条序列，以及相应的gff3注释信息;同时也得到了一个分块的信息文件。该程序的运行和参数为：

$ $EVMHome/EvmUtils/partition_EVM_inputs.pl --genome genome.fasta --gene_predictions gene_predictions.gff3 --transcript_alignments transcript_alignments.gff3 --segmentSize 500000 --overlapSize 10000 --partition_listing partitions_list.out

--genome                * :fasta file containing all genome sequences
--gene_predictions      * :file containing gene predictions
--protein_alignments      :file containing protein alignments
--transcript_alignments   :file containing transcript alignments
--pasaTerminalExons       :file containing terminal exons based on PASA long-orf data.
--repeats                 :file containing repeats to be masked
--segmentSize           * :length of a single sequence for running EVM
--overlapSize           * :length of sequence overlap between segmented sequences
--partition_listing     * :name of output file to be created that contains the list of partitions
-h                        : help message with more detailed information.
(*required options)

3.5 生成EVM命令

上一步生成了很多文件夹，这一步对每个文件夹生成一个EVM运行的命令，并将这些命令放入的文件commands.list中。

$ $EVMHome/EvmUtils/write_EVM_commands.pl --genome genome.fasta --gene_predictions gene_predictions.gff3 --transcript_alignments transcript_alignments.gff3 --weights `pwd`/weights.txt --output_file_name evm.out partitions_list.out > commands.list

3.6 执行commands.list中的命令

对于这些命令，可以使用计算机集群进行多线程的运算；当然也能将这些命令拆成多份，一起运算，也就是多线程的了；可以使用EVM自带的一个程序来运行，这样有好的结果展示。多线程运行能节约时间。每个命令的执行后能在对应的文件夹生成一个名为 evm.out 的结果文件(上一步的–output_file_name参数的设定)。

$EVMHome/EvmUtils/execute_EVM_commands.pl commands.list | tee run.log

3.7 和并EVM的文件

将每个文件夹中的EVM输出结果文件进行合并，得到最后的gff3文件。

$EVMHome/
EvmUtils/convert_EVM_outputs_to_GFF3.pl --partitions partitions_list.out --output_file_name evm.out --genome genome.fasta

陈连福的生信博客

第22期培训班将于2024.01.27-2024.02.05期间在武汉市举办！

使用EVM整合基因预测结果

1. EVM下载和安装

2. EVM的使用步骤

2.1 收集gff3格式的基因注释内容

2.2 制作Evidence Weights File

3.3 制作PASA-supported terminal exons supplement

3.4 运行EVM的分块程序

3.5 生成EVM命令

3.6 执行commands.list中的命令

3.7 和并EVM的文件

发表评论取消回复

1. EVM下载和安装

2. EVM的使用步骤

2.1 收集gff3格式的基因注释内容

2.2 制作Evidence Weights File

3.3 制作PASA-supported terminal exons supplement

3.4 运行EVM的分块程序

3.5 生成EVM命令

3.6 执行commands.list中的命令

3.7 和并EVM的文件

发表评论 取消回复

发表评论取消回复