{"id":2307,"date":"2015-08-25T16:47:38","date_gmt":"2015-08-25T08:47:38","guid":{"rendered":"http:\/\/www.chenlianfu.com\/?p=2307"},"modified":"2015-08-25T16:47:38","modified_gmt":"2015-08-25T08:47:38","slug":"augustus-training-and-prediction","status":"publish","type":"post","link":"http:\/\/www.chenlianfu.com\/?p=2307","title":{"rendered":"Augustus Training and Prediction"},"content":{"rendered":"<h1>1. Augustus training<\/h1>\n<p>\u9996\u9009\uff0c\u9700\u8981\u6709\u81f3\u5c11 200 \u4e2a\u5b8c\u6574\u57fa\u56e0\u6a21\u578b\u7684\u6570\u636e\u3002 \u4f8b\u5982\uff1a \u4f7f\u7528 genome-guided \u65b9\u6cd5\u8fdb\u884c trinity \u6709\u53c2\u8003\u57fa\u56e0\u7ec4\u7684 <em>de novo<\/em> \u7ec4\u88c5\uff1b\u518d\u4f7f\u7528 PASA \u5c06\u7ec4\u88c5\u51fa\u6765\u7684 inchworm \u5e8f\u5217\u6bd4\u5bf9\u5230\u57fa\u56e0\u7ec4\uff1b \u518d\u63d0\u53d6\u5b8c\u6574\u57fa\u56e0\u6a21\u578b\u6570\u636e\uff0c\u5f97\u5230\u6587\u4ef6 trainingSet_CompleteBest.gff3 \u3002<\/p>\n<p>\u7136\u540e\uff0c\u5c06 GFF3 \u6587\u4ef6\u8f6c\u6362\u4e3a GeneBank \u6587\u4ef6\uff1a<\/p>\n<pre>\r\n$ \/opt\/biosoft\/augustus-3.0.3\/scripts\/gff2gbSmallDNA.pl trainingSet_CompleteBest.gff3 ..\/..\/genome.fasta 50 trainingSetComplete.gb\r\n\r\n\u4f7f\u7528 genebank \u683c\u5f0f\u7684\u6587\u4ef6\u9884\u5148\u8fdb\u884c\u4e00\u6b21 Augustus training\uff0c\u4e00\u822c\u4f1a\u5f97\u5230\u4e00\u4e9b\u9519\u8bef\u63d0\u793a\r\n$ etraining --species=generic --stopCodonExcludedFromCDS=false trainingSetComplete.gb 2> train.err\r\n\r\n\u6839\u636e\u9519\u8bef\u63d0\u793a\uff0c\u63d0\u53d6\u51fa\u6709\u9519\u8bef\u7684\u57fa\u56e0\u6a21\u578b\r\n$ cat train.err | perl -ne 'print \"$1\\n\" if \/in sequence (\\S+):\/' > badlist\r\n\r\n\u53bb\u9664\u9519\u8bef\u7684\u57fa\u56e0\u6a21\u578b\uff0c\u5f97\u5230\u80fd\u7528\u4e8e training \u7684\u57fa\u56e0\u6a21\u578b\r\n$ \/opt\/biosoft\/augustus-3.0.3\/scripts\/filterGenes.pl badlist trainingSetComplete.gb > training.gb\r\n<\/pre>\n<p>\u5728\u8fdb\u884c Augustus training \u4e4b\u524d\uff0c\u6700\u597d\u4fdd\u8bc1\u8fd9\u4e9b\u57fa\u56e0\u6a21\u578b\u4e24\u4e24\u4e4b\u95f4\u5728\u6c28\u57fa\u9178\u6c34\u5e73\u7684 identity < 70% \uff1a\n\n\n<pre>\r\n\u5148\u83b7\u53d6\u8fd9\u4e9b\u57fa\u56e0\u6a21\u578b\u7684 Proteins \u5e8f\u5217\r\n$ \/opt\/biosoft\/augustus-3.0.3\/scripts\/gbSmallDNA2gff.pl training.gb > training.gff2\r\n$ perl -ne &#8216;print &#8220;$1\\n&#8221; if \/gene_id \\&#8221;(\\S+?)\\&#8221;\/&#8217; training.gff2 | uniq > trainSet.lst\r\n$ perl extract_genes.pl trainingSet_CompleteBest.gff3 trainSet.lst > training.gff3\r\n$ \/opt\/biosoft\/EVM_r2012-06-25\/EvmUtils\/gff3_file_to_proteins.pl training.gff3 genome.fasta prot > training.protein.fasta\r\n$ perl -p -i -e &#8216;s\/(>\\S+).*\/$1\/&#8217; training.protein.fasta\r\n$ perl -p -i -e &#8216;s\/\\*\/\/&#8217; training.protein.fasta\r\n\r\n\u5bf9\u8fd9\u4e9b proteins \u5e8f\u5217\u6784\u5efa blast \u6570\u636e\u5e93\uff0c\u5e76\u5c06 proteins \u5e8f\u5217\u6bd4\u5bf9\u5230\u6b64\u6570\u636e\u5e93\r\n$ makeblastdb -in training.protein.fasta -dbtype prot -title training.protein.fasta -parse_seqids -out training.protein.fasta\r\n$ blastp -db training.protein.fasta -query training.protein.fasta -out training.protein.fasta.out -evalue 1e-5 -outfmt 6 -num_threads 8\r\n\r\n\u63d0\u53d6 blast \u7ed3\u679c\u4e2d identity >= 70% \u7684\u6bd4\u5bf9\u4fe1\u606f\uff0cidentity \u9ad8\u7684 proteins \u5e8f\u5217\u4ec5\u4fdd\u7559\u4e00\u6761\r\n$ grep -v -P &#8220;\\t100.00\\t&#8221; training.protein.fasta.out | perl -ne &#8216;split; print if $_[2] >= 70&#8217; > blast.out\r\n$ perl delete_high_identity_proteins_in_training_gff3.pl training.protein.fasta blast.out training.gff3 > training.new.gff3\r\n\r\n\u83b7\u5f97\u53bb\u9664\u4e86\u5197\u4f59\u7684\u57fa\u56e0\u6a21\u578b\r\n$ \/opt\/biosoft\/augustus-3.0.3\/scripts\/gff2gbSmallDNA.pl training.new.gff3 genome.fasta 50 training.gb\r\n<\/pre>\n<p>\u8fdb\u884c Augustus Training without CRF<\/p>\n<pre>\r\n\u521d\u59cb\u5316\u672c\u7269\u79cd\u7684 HMM \u6587\u4ef6\r\n$ \/opt\/biosoft\/augustus-3.0.3\/scripts\/new_species.pl --species=my_species\r\n\r\n\u5982\u679c\u6709 RNA-Seq \u6570\u636e\uff0c\u83b7\u5f97\u80fd\u7528\u4e8e training \u7684\u57fa\u56e0\u6a21\u578b\u4e00\u822c\u4f1a\u6709\u597d\u51e0\u5343\u4e2a\u3002\u6211\u4eec\u5c06\u57fa\u56e0\u6a21\u578b\u968f\u673a\u5206\u6210\u4e24\u4efd\uff1a \u7b2c\u4e00\u4efd 300 \u4e2a\u57fa\u56e0\uff0c\u7528\u4e8e\u68c0\u6d4b training \u7684\u7cbe\u786e\u6027\uff1b \u53e6\u5916\u4e00\u4efd\u542b\u6709\u66f4\u591a\u7684\u57fa\u56e0\uff0c\u7528\u4e8e\u8fdb\u884c Augustus training\u3002\r\n$ \/opt\/biosoft\/augustus-3.0.3\/scripts\/randomSplit.pl training.gb 300\r\n\r\n\u4f7f\u7528\u5927\u4efd\u7684 traing.gb.train \u8fdb\u884c Augustus Training\r\n$ etraining --species=my_species training.gb.train >train.out\r\n\r\n\u6839\u636e\u8f93\u51fa\u7ed3\u679c train.out \u6765\u4fee\u6b63\u53c2\u6570\u6587\u4ef6 species_parameters.cfg \u4e2d\u7ec8\u6b62\u5bc6\u7801\u5b50\u7684\u9891\u7387\r\n$ tag=$(perl -ne 'print \"$1\" if \/tag:\\s+\\d+\\s+\\((\\S+)\\)\/' train.out)\r\n$ perl -p -i -e \"s#\/Constant\/amberprob.*#\/Constant\/amberprob                   $tag#\" \/opt\/biosoft\/augustus-3.0.3\/config\/species\/lentinula_edodes\/lentinula_edodes_parameters.cfg\r\n$ taa=$(perl -ne 'print \"$1\" if \/taa:\\s+\\d+\\s+\\((\\S+)\\)\/' train.out)\r\n$ perl -p -i -e \"s#\/Constant\/ochreprob.*#\/Constant\/ochreprob                   $taa#\" \/opt\/biosoft\/augustus-3.0.3\/config\/species\/lentinula_edodes\/lentinula_edodes_parameters.cfg\r\n$ tga=$(perl -ne 'print \"$1\" if \/tga:\\s+\\d+\\s+\\((\\S+)\\)\/' train.out)\r\n$ perl -p -i -e \"s#\/Constant\/opalprob.*#\/Constant\/opalprob                    $tga#\" \/opt\/biosoft\/augustus-3.0.3\/config\/species\/lentinula_edodes\/lentinula_edodes_parameters.cfg\r\n\r\n\u6839\u636e training \u7684\u7ed3\u679c\uff0c\u8fdb\u884c\u7b2c\u4e00\u904d\u7cbe\u786e\u6027\u8bc4\u4f30\r\n$ augustus --species=my_species training.gb.test > test.1.out\r\n\r\n\u518d\u5c06 training.gb.train \u5206\u6210\u4e24\u4efd\uff1a \u7b2c\u4e00\u4efd 800 \u4e2a\u57fa\u56e0\uff0c\u5269\u4e0b\u57fa\u56e0\u4e3a\u53e6\u5916\u4e00\u4efd\u3002\u8fd9\u4e24\u4efd\u57fa\u56e0\u90fd\u7528\u4e8e Optimizing meta parameters of AUGUSTUS\r\n$ \/opt\/biosoft\/augustus-3.0.3\/scripts\/randomSplit.pl training.gb.train 800\r\n\r\n\u4f7f\u7528\u4e0a\u9762\u7684 2 \u4efd\u57fa\u56e0\u6a21\u578b\uff0c\u8fdb\u884c Augustus training \u7684\u4f18\u5316\r\n$ \/opt\/biosoft\/augustus-3.0.3\/scripts\/optimize_augustus.pl --species=my_species --rounds=5 --cpus=16 --kfold=16 training.gb.train.test --onlytrain=training.gb.onlytrain --metapars=\/opt\/biosoft\/augustus-3.0.3\/config\/species\/my_species\/my_species_metapars.cfg > optimize.out\r\n\u6309\u5982\u4e0a\u53c2\u6570\uff0c\u5219\u7a0b\u5e8f\u4f1a\u5bf9 my_species_metapars.cfg \u6587\u4ef6\u4e2d\u7684 28 \u4e2a\u53c2\u6570\u8fdb\u884c\u4f18\u5316\uff0c\u603b\u5171\u4f18\u5316 5 \u8f6e\u6216\u6709\u4e00\u8f6e\u627e\u4e0d\u5230\u53ef\u4f18\u5316\u7684\u53c2\u6570\u4e3a\u6b62\u3002\u6bcf\u8fdb\u884c\u4e00\u4e2a\u53c2\u6570\u7684\u4f18\u5316\u65f6\uff1a \u5c06 training.gb.train.test \u6587\u4ef6\u4e2d 800 \u4e2a\u57fa\u56e0\u968f\u673a\u5206\u6210 16 \u7b49\u4efd\uff0c\u53d6\u5176\u4e2d 15 \u7b49\u4efd\u548c training.gb.onlytrain \u4e2d\u7684\u57fa\u56e0\u6a21\u578b\u4e00\u8d77\u8fdb\u884c training\uff0c\u5269\u4e0b\u7684 1 \u7b49\u4efd\u7528\u4e8e\u7cbe\u786e\u884c\u8bc4\u4f30\uff1b \u8981\u5bf9\u6bcf\u4e2a\u7b49\u4efd\u90fd\u8fdb\u884c\u4e00\u6b21\u7cbe\u786e\u6027\u8bc4\u4f30\uff1b\u4f7f\u7528 16 \u4e2a CPU \u5bf9 16 \u4e2a\u7b49\u4efd\u5e76\u884c\u8fdb\u884c training \u548c \u7cbe\u786e\u6027\u8bc4\u4f30\uff0c\u5f97\u5230\u5e73\u5747\u7684\u7cbe\u786e\u503c\uff1b\u4f18\u5316\u7684\u6bcf\u4e2a\u53c2\u6570\u4f1a\u5206\u522b 3~4 \u4e2a\u503c\uff0c\u6bcf\u4e2a\u503c\u5f97\u5230\u4e00\u4e2a training \u7684\u7cbe\u786e\u503c\uff0c\u5bf9\u53c2\u6570\u7684\u591a\u4e2a\u8bbe\u5b9a\u503c\u8fdb\u884c\u6bd4\u8f83\uff0c\u627e\u5230\u6700\u4f73\u7684\u503c\u3002\r\n\u6b64\u5916\uff0c training \u7684\u7cbe\u786e\u503c\u7684\u7b97\u6cd5\uff1a accuracy value = (3*nucleotide_sensitivity + 2*nucleotide_specificity + 4*exon_sensitivity + 3*exon_specificity + 2*gene_sensitivity + 1*gene_specificity) \/ 15 \u3002\r\n\r\n\u4f18\u5316\u53c2\u6570\u5b8c\u6bd5\u540e\uff0c\u9700\u8981\u518d\u6b64\u4f7f\u7528 training.gb.train \u8fdb\u884c\u4e00\u904d training\r\n$ etraining --species=my_species training.gb.train\r\n\r\n\u518d\u8fdb\u884c\u7b2c\u4e8c\u904d\u7684\u7cbe\u786e\u6027\u8bc4\u4f30\uff0c\u4e00\u822c\u8fdb\u884c\u4f18\u5316\u540e\uff0c\u7cbe\u786e\u6027\u4f1a\u6709\u8f83\u5927\u5e45\u5ea6\u7684\u63d0\u9ad8\r\n$ augustus --species=my_species training.gb.test > test.2.withoutCRF.aout\r\n<\/pre>\n<p>\u8fdb\u884c Augustus Training with CRF(Conditional Random Field)<\/p>\n<pre>\r\n\u5728\u8fdb\u884c Training with CRF \u4e4b\u524d\uff0c\u5148\u5907\u4efd\u975e CRF \u7684\u53c2\u6570\u6587\u4ef6\r\n$ cd \/opt\/biosoft\/augustus-3.0.3\/config\/species\/species\/\r\n$ cp species_exon_probs.pbl lentinula_edodes_exon_probs.pbl.withoutCRF\r\n$ cp species_igenic_probs.pbl lentinula_edodes_igenic_probs.pbl.withoutCRF\r\n$ cp species_intron_probs.pbl lentinula_edodes_intron_probs.pbl.withoutCRF\r\n$ cd -\r\n\r\n\u5728 training \u7684\u65f6\u5019\uff0c\u52a0\u5165 --CRF \u53c2\u6570\uff0c\u8fd9\u6837\u8fdb\u884c training \u6bd4\u8f83\u8017\u65f6\r\n$ etraining --species=my_species training.gb.train --CRF=1 1>train.CRF.out 2>train.CRF.err\r\n\r\n\u518d\u6b21\u8fdb\u884c\u7cbe\u786e\u884c\u8bc4\u4f30\r\n$ augustus --species=my_species training.gb.test > test.2.CRF.out\r\n\u6bd4\u8f83 CRF \u548c \u975e CRF \u4e24\u79cd\u60c5\u51b5\u4e0b\u7684\u7cbe\u786e\u6027\u3002\u4e00\u822c\u60c5\u51b5\u4e0b\uff0cCRF training \u7684\u7cbe\u786e\u6027\u8981\u9ad8\u4e9b\u3002\u82e5 CRF training \u7684\u7cbe\u786e\u884c\u4f4e\u4e9b\uff0c\u5219\u5c06\u5907\u4efd\u7684\u53c2\u6570\u6587\u4ef6\u8fd8\u539f\u56de\u53bb\u5373\u53ef\u3002\r\n<\/pre>\n<h1>2. Training UTR parameters for Augustus<\/h1>\n<p>Training UTR \u6709\u5229\u4e8e\u5229\u7528 exonpart hint\u3002 \u5f53\u4f7f\u7528 exonpart hint \u65f6\uff0c\u5bf9\u57fa\u56e0\u7ed3\u6784\u9884\u6d4b\u6709\u663e\u8457\u63d0\u9ad8\u3002<\/p>\n<p>\u9996\u5148\uff0c\u9700\u8981\u83b7\u5f97\u7528\u4e8e training \u7684\u57fa\u56e0\u6a21\u578b\uff0c\u8fd9\u4e9b\u57fa\u56e0\u6a21\u578b\u9700\u8981\u540c\u65f6\u5e26\u6709 5&#8217;UTR \u548c 3&#8217;UTR \u3002<\/p>\n<pre>\r\n$ mkdir utr\r\n$ cd utr\r\n\r\n\u4ece\u7528\u4e8e Augustus Training \u7684\u6570\u636e\u4e2d\u63d0\u53d6\u540c\u65f6\u5e26\u6709 5'UTR \u548c 3'UTR \u7684\u57fa\u56e0\u6a21\u578b\u3002\r\n$ perl -ne 'print \"$2\\t$1\\n\" if \/.*\\t(\\S+UTR)\\t.*ID=(\\S+).utr\/' ..\/training.new.gff3 | sort -u | perl -ne 'split; print \"$_[0]\\n\" if ($g eq $_[0]); $g = $_[0];' > bothutr.lst\r\n$ perl -e 'open IN, \"bothutr.lst\"; while (<IN>) {chomp; $keep{$_}=1} $\/=\"\/\/\\n\"; while (<>) { if (\/gene=\\\"(\\S+?)\\\"\/ && exists $keep{$1}) {print} }' ..\/training.gb.test > training.gb.test\r\n$ perl -e 'open IN, \"bothutr.lst\"; while (<IN>) {chomp; $keep{$_}=1} $\/=\"\/\/\\n\"; while (<>) { if (\/gene=\\\"(\\S+?)\\\"\/ && exists $keep{$1}) {print} }' ..\/training.gb.train > training.gb.train\r\n\r\n\u4f7f\u7528 traing.gb.train \u4e2d\u7684\u57fa\u56e0\u6a21\u578b\u8fdb\u884c Training UTR parameters\r\n$ \/opt\/biosoft\/augustus-3.0.3\/scripts\/randomSplit.pl training.gb.train 400\r\n$ mv training.gb.train.train training.gb.onlytrain\r\n$ optimize_augustus.pl --species=my_species --cpus=16 --rounds=5 --kfold=16 --UTR=on --trainOnlyUtr=1 --onlytrain=training.gb.onlytrain --metapars=\/opt\/biosoft\/augustus-3.0.3\/config\/species\/lentinula_edodes\/lentinula_edodes_metapars.utr.cfg training.gb.train.test > optimize.out\r\n\r\n\u8fdb\u884c without CRF \u548c with CRF \u7684\u7cbe\u786e\u6027\u6bd4\u8f83\uff0c\u9009\u53d6\u5176\u4e2d\u7cbe\u786e\u6027\u8f83\u9ad8\u7684\u53c2\u6570\u6587\u4ef6\r\n$ etraining --species=species --UTR=on training.gb.train\r\n$ augustus --species=species --UTR=on training.gb.test > test.withoutCRF.out\r\n$ etraining --species=species --UTR=on training.gb.train --CRF=1\r\n$ augustus --species=species --UTR=on training.gb.test > test.CRF.out\r\n<\/pre>\n<h1>3. Creating hints from RNA-Seq data with Tophat<\/h1>\n<p>\u5148\u4f7f\u7528 tophat \u5c06 RNA-Seq \u6570\u636e\u6bd4\u5bf9\u5230\u5c4f\u853d\u4e86\u91cd\u590d\u5e8f\u5217\u7684\u57fa\u56e0\u7ec4\uff0c\u5f97\u5230 bam \u6587\u4ef6\u3002\u5bf9 bam \u6587\u4ef6\u8fdb\u884c\u8f6c\u6362\uff0c\u5f97\u5230 intron \u548c exonpart hints\u3002<\/p>\n<pre>\r\n\u5f97\u5230 intron \u7684 hints \u4fe1\u606f\r\n$ \/opt\/biosoft\/augustus-3.0.3\/bin\/bam2hints --in=tophat.bam --out=hints.intron.gff --maxgenelen=30000 --intronsonly\r\n\r\n\u5f97\u5230 exonpart \u7684 hints \u4fe1\u606f\r\n$ \/opt\/biosoft\/augustus-3.0.3\/bin\/bam2wig tophat.bam tophat.wig\r\n$ cat tophat.wig | \/opt\/biosoft\/augustus-3.0.3\/scripts\/wig2hints.pl --width=10 --margin=10 --minthresh=2 --minscore=4 --prune=0.1 --src=W --type=ep --UCSC=unstranded.track --radius=4.5 --pri=4 --strand=\".\" > hints.exonpart.gff\r\n\r\ncat hints.intron.gff hints.exonpart.gff > hints.rnaseq.gff\r\n<\/pre>\n<h1>4. Creating hints from Protiens with Exonerate<\/h1>\n<p>\u8981\u5148\u5f97\u5230\u4e34\u8fd1\u7269\u79cd\u7684 pritein \u5e8f\u5217\uff0c\u7136\u540e\u4f7f\u7528 tblastn \u5c06\u8fd9\u4e9b protein \u5e8f\u5217\u6bd4\u5bf9\u5230\u57fa\u56e0\u7ec4\uff0c\u518d\u5c06\u76f8\u4f3c\u6027\u8f83\u9ad8\u7684\u5e8f\u5217\u4f7f\u7528 exonerate \u6bd4\u5bf9\u5230\u57fa\u56e0\u7ec4\uff0c\u5bf9 exonerate \u7ed3\u679c\u8fdb\u884c\u5206\u6790\uff0c\u5f97\u5230 hint \u4fe1\u606f\u3002<\/p>\n<pre>\r\n\u5c06 protein \u5e8f\u5217\u6bd4\u5bf9\u5230\u5c4f\u853d\u4e86\u91cd\u590d\u5e8f\u5217\u548c\u8f6c\u5f55\u7ec4\u5e8f\u5217\u5339\u914d\u533a\u57df\u7684\u57fa\u56e0\u7ec4\u3002\u8fd9\u6837\u5f97\u5230\u7684 hints \u4e3b\u8981\u4f4d\u4e8e\u8f6c\u5f55\u7ec4\u4e2d\u672a\u8868\u8fbe\u7684\u57fa\u56e0\u5904\uff0c\u4ee5\u514d\u548c\u8f6c\u5f55\u7ec4\u7684 hint \u6709\u51b2\u7a81\uff0c\u6216\u5f97\u5230\u8fde\u63a5 2 \u4e2a\u76f8\u90bb\u57fa\u56e0\u7684\u9519\u8bef hint\u3002\r\n$ makeblastdb -in genome.estMasked.fa -dbtype nucl -title genome -parse_seqids -out genome\r\n$ blast.pl tblastn genome proteins.fasta 1e-8 24 blast 5\r\n\r\n\u5bf9 blast \u7684\u7ed3\u679c\u8fdb\u884c\u5206\u6790\uff0c\u6311\u9009\u76f8\u4f3c\u6027\u9ad8\u7684 pritein \u5e8f\u5217\u7528\u4e8e exonerate \u5206\u6790\r\n$ perl tblastn_xml_2_exonerate_commands.pl --exonerate_percent 50 --exonerate_maxintron 20000 --cpu 24 blast.xml proteins.fasta genome.estMasked.fa\r\n\r\n\u5c06 exonerate \u7ed3\u679c\u8f6c\u6362\u4e3a hint \u6587\u4ef6\r\n$ \/opt\/biosoft\/augustus-3.0.3\/scripts\/exonerate2hints.pl --in=exonerate.out --maxintronlen=20000 --source=P --minintron=31 --out=hints.protein.gff\r\n<\/pre>\n<h1>5. Creating hints from RepeatMasker output<\/h1>\n<p>\u5728\u8f6c\u5ea7\u5b50\u91cd\u590d\u533a\uff0c\u4e0d\u503e\u5411\u4e8e\u5b58\u5728\u7f16\u7801\u533a\uff0c\u8be5\u533a\u57df\u53ef\u4ee5\u4f5c\u4e3a nonexonpart hint\u3002\u5c06 RepeatMakser \u7ed3\u679c\u8f6c\u6362\u4e3a\u6b64\u7c7b hint\uff0c\u6709\u529b\u4e8e exon \u7684\u9884\u6d4b\u3002\u8fd9\u6837\u6bd4\u4f7f\u7528\u5c4f\u853d\u4e86\u91cd\u590d\u5e8f\u5217\u7684\u57fa\u56e0\u7ec4\u8fdb\u884c Augustus gene prediction \u8981\u597d\u3002<\/p>\n<pre>\r\n\u4e0d\u5bf9 Simple_repeat, Low_complexity \u548c Unknown \u8fd9 3 \u7c7b\u8fdb\u884c\u5c4f\u853d\uff0c\u56e0\u4e3a\u8fd9\u4e9b\u533a\u57df\u5b58\u5728 CDS \u7684\u53ef\u80fd\u6027\u76f8\u5bf9\u8f83\u5927\u3002\r\n$ grep -v -P \"position|begin|Unknown|Simple_repeat|Low_complexity\" genome.repeat.out | perl -pe 's\/^\\s*$\/\/' | perl -ne 'chomp; s\/^\\s+\/\/; @t = split(\/\\s+\/); print $t[4].\"\\t\".\"repmask\\tnonexonpart\\t\".$t[5].\"\\t\".$t[6].\"\\t0\\t.\\t.\\tsrc=RM\\n\";' > hints.repeats.gff\r\n<\/pre>\n<h1>6. Run Augustus predictions parallel<\/h1>\n<p>\u63a8\u8350\u4f7f\u7528 hints \u8fdb\u884c Augustus gene prediction\uff0c\u8fd9\u6837\u5728\u6709 hints \u533a\u57df\u7684\u57fa\u56e0\u9884\u6d4b\u4f1a\u975e\u5e38\u51c6\u786e\u3002 hints \u8d8a\u51c6\u786e\uff0c\u8986\u76d6\u57fa\u56e0\u7ec4\u7684\u533a\u57df\u8d8a\u5e7f\u6cdb\uff0c\u5219\u57fa\u56e0\u9884\u6d4b\u8d8a\u51c6\u786e\u3002<br \/>\nhint \u7684\u79cd\u7c7b\u5f88\u591a\uff1a \u6709\u4eba\u5de5\u8fdb\u884c\u786e\u5b9a\u7684 hints\uff1b \u4f7f\u7528 RNA-Seq \u6570\u636e\u5f97\u5230 intron \u548c exonpart \u7c7b\u578b\u7684 hint\uff1b \u4f7f\u7528 protein \u5f97\u5230 intron \u548c cdspart \u7c7b\u578b\u7684 hint\uff1b \u4f7f\u7528 RepeatMasker \u5f97\u5230 rm \u7c7b\u578b\u7684 hints\u3002<br \/>\n\u5bf9\u8fd9\u4e9b\u4e0d\u540c\u7c7b\u578b\u7684 hint \u8981\u91c7\u7528\u5bf9\u5e94\u7684\u53c2\u6570\u6765\u6307\u5bfc\u57fa\u56e0\u9884\u6d4b\u3002 \u8fd9\u4e9b\u53c2\u6570\u4e3b\u8981\u7528\u4e8e\u6307\u5bfc augustus \u5bf9\u76f8\u5e94\u7c7b\u578b\u7684 hints \u8fdb\u884c\u5f97\u5206\u5956\u7f5a\u3002\u76f8\u5173\u7684\u53c2\u6570\u6587\u4ef6\u4f4d\u4e8e \/opt\/biosoft\/augustus-3.0.3\/config\/extrinsic\/ \u6587\u4ef6\u5939\u4e0b\u3002<\/p>\n<p>\u4e00\u4e2a\u63a8\u8350\u7684\u914d\u7f6e\u53c2\u6570\u6587\u4ef6\u5982\u4e0b\uff1a<\/p>\n<pre>\r\n[SOURCES]\r\nM RM P E W\r\n[SOURCE-PARAMETERS]\r\n[GENERAL]\r\n      start             1             1  M    1  1e+100  RM    1      1    P    1       1    E    1       1    W 1    1\r\n       stop             1             1  M    1  1e+100  RM    1      1    P    1       1    E    1       1    W 1    1\r\n        tss             1             1  M    1  1e+100  RM    1      1    P    1       1    E    1       1    W 1    1\r\n        tts             1             1  M    1  1e+100  RM    1      1    P    1       1    E    1       1    W 1    1\r\n        ass             1             1  M    1  1e+100  RM    1      1    P    1       1    E    1       1    W 1    1\r\n        dss             1             1  M    1  1e+100  RM    1      1    P    1       1    E    1       1    W 1    1\r\n   exonpart             1          .992  M    1  1e+100  RM    1      1    P    1       1    E    1       1    W 1    1.005\r\n       exon             1             1  M    1  1e+100  RM    1      1    P    1       1    E    1       1    W 1    1\r\n intronpart             1             1  M    1  1e+100  RM    1      1    P    1       1    E    1       1    W 1    1\r\n     intron             1           .34  M    1  1e+100  RM    1      1    P    1    1000    E    1     1e5    W 1    1\r\n    CDSpart             1       1 0.985  M    1  1e+100  RM    1      1    P    1     1e5    E    1       1    W 1    1\r\n        CDS             1             1  M    1  1e+100  RM    1      1    P    1       1    E    1       1    W 1    1\r\n    UTRpart             1       1 0.985  M    1  1e+100  RM    1      1    P    1       1    E    1       1    W 1    1\r\n        UTR             1             1  M    1  1e+100  RM    1      1    P    1       1    E    1       1    W 1    1\r\n     irpart             1             1  M    1  1e+100  RM    1      1    P    1       1    E    1       1    W 1    1\r\nnonexonpart             1             1  M    1  1e+100  RM    1   1.01    P    1       1    E    1       1    W 1    1\r\n  genicpart             1             1  M    1  1e+100  RM    1      1    P    1       1    E    1       1    W 1    1\r\n<\/pre>\n<p>\u5bf9\u4e0a\u9762\u914d\u7f6e\u6587\u4ef6\u7684\u5185\u5bb9\u7b80\u5355\u8bb2\u89e3\uff1a<\/p>\n<pre>\r\n\u770b\u6b64\u914d\u7f6e\u6587\u4ef6\u4e4b\u524d\uff0c\u8981\u770b hint \u6587\u4ef6\u5185\u5bb9\u4e2d\u7b2c 9 \u5217\u6709 src=P \u8fd9\u6837\u7684\u4fe1\u606f\u3002 Augustus \u901a\u8fc7\u6b64\u4fe1\u606f\u6765\u786e\u5b9a hint \u7684\u6765\u6e90\uff0c\u7136\u540e\u6839\u636e extrinsic \u914d\u7f6e\u6587\u4ef6\u4e2d\u76f8\u5e94\u7684\u53c2\u6570\u6765\u5904\u7406 hints \u6587\u4ef6\u4e2d\u7684\u5185\u5bb9\u3002\r\n\r\n[SOURCES] \u8bbe\u7f6e hint \u7684\u6765\u6e90\r\nM RM P E W  \u4e0d\u540c\u6765\u6e90\u7684 hint \u7c7b\u578b\u4f7f\u7528\u7a7a\u683c\u5206\u9694\r\nM \uff1a \u624b\u5de5\u951a\u5b9a\u7684 hint\r\nRM \uff1a RepeatMasker \u83b7\u5f97\u7684 hint\r\nP \uff1a \u6765\u6e90\u4e8e protein \u7684 hint\r\nE \uff1a \u6765\u6e90\u4e8e EST \u7684 hint\r\nW \uff1a \u4f7f\u7528 RNA-Seq \u8fdb\u884c wiggle track coverage \u5206\u6790\u5f97\u5230\u7684 hint\r\n\r\n[SOURCE-PARAMETERS] \u8bbe\u7f6e hins \u6765\u6e90\u7684\u53c2\u6570\r\nE 1group1gene  \u7b2c 1 \u5217\u662f hints \u7684\u6765\u6e90\uff0c \u7b2c 2 \u5217\u8868\u660e\u5bf9\u8be5\u6765\u6e90\u7684 hints \u8fdb\u884c\u7684\u5904\u7406\u3002\u6709 2 \u79cd\u5904\u7406\u65b9\u5f0f\uff1a\r\nindividual_liability\r\n    \u4f8b\u5982\uff0c 1 \u4e2a group \u5728 hint \u6587\u4ef6\u4e2d\u5305\u542b\u591a\u884c\uff0c\u5373\u7531\u591a\u4e2a hints \u7ec4\u6210\uff08\u6bd4\u5982\uff0c 1 \u4e2a\u57fa\u56e0\u6709\u591a\u4e2a intron\uff09\uff0c\u5f53\u5176\u4e2d\u4e00\u4e2a hint \u4e0d\u6b63\u786e\u65f6\uff0c\u9ed8\u8ba4\u60c5\u51b5\u4e0b\u5219\u4f1a\u5bf9\u6574\u4e2a group \u8fdb\u884c\u5ffd\u7565\u3002\u800c\u52a0\u5165\u8be5\u53c2\u6570\u5219\u4ec5\u5ffd\u7565\u9519\u8bef\u7684 hints\u3002\r\n1group1gene:\r\n    \u5bf9 1 \u4e2a group\uff0c\u8bd5\u56fe\u9884\u6d4b 1 \u4e2a\u57fa\u56e0\u3002\u9002\u5408\u4f7f\u7528\u6bd4\u8f83\u5b8c\u6574\u7684 transcripts \u5e8f\u5217\u505a\u7684 hint\u3002\r\n\r\n[GENERAL]  \u4e0d\u540c\u7c7b\u578b\u4e0d\u540c\u6765\u6e90 hints \u7684\u53c2\u6570\u8bbe\u7f6e\r\n\u524d 3 \u5217\u662f\u5bf9\u4e0d\u540c\u7c7b\u578b\u7684 hints \u7684\u5956\u7f5a\u7cfb\u6570\u7684\u8bbe\u7f6e\uff1a\r\n\u7b2c 1 \u5217\r\n    hint \u7684\u7c7b\u578b\u3002 \u5f53 hints \u6587\u4ef6\u4e2d\u6ca1\u6709\u8be5\u7c7b\u578b\u7684 hint \u65f6\uff0c \u5219\u540e\u9762\u4e0d\u540c\u6765\u6e90\u7684 hints \u6570\u503c\u90fd\u4f7f\u7528 1 \u3002\r\n\u7b2c 2 \u5217\r\n    \u5956\u52b1\u7cfb\u6570\u3002 \u5f53\u8be5\u7c7b\u578b hint \u5168\u90e8\u5339\u914d\u7684\u65f6\u5019\uff0c\u5219\u8fdb\u884c\u5956\u52b1\u3002\u5956\u52b1\u7cfb\u6570\u7531\u8be5\u503c\u548c\u76f8\u5e94 hints \u6765\u6e90\u7684\u8bbe\u7f6e\u4e00\u8d77\u51b3\u5b9a\u3002\r\n\u7b2c 3 \u5217\r\n    \u60e9\u7f5a\u7cfb\u6570\u3002 \u6bcf\u5f53\u6709 1 \u4e2a\u78b1\u57fa\u4e0d\u5339\u914d\uff0c\u5219\u5f97\u5206\u4e58\u4ee5\u8be5\u7cfb\u6570\u3002\u82e5\u6709 100 \u4e2a\u78b1\u57fa\u4e0d\u5339\u914d\uff0c\u662f\u8be5\u5217\u503c\u7684 100 \u6b21\u65b9\uff0c\u56e0\u6b64\uff0c\u8be5\u503c\u8bbe\u7f6e\u4e00\u822c\u7565\u6bd4 1 \u5c0f\u3002\u8be5\u503c\u8d8a\u5c0f\uff0c\u8d8a\u80fd\u589e\u52a0\u57fa\u56e0\u9884\u6d4b\u7684 specficity\u3002\r\n\r\n\u540e\u9762\u7684\u5217\uff0c\u8868\u793a\u4e0d\u540c\u7684\u6765\u6e90\u7684 hints \u7684\u5956\u52b1\u60e9\u7f5a\u7cfb\u6570\uff0c\u6bcf\u4e2a\u6765\u6e90\u7684 hints \u8bbe\u7f6e\u5206 3 \u5217\uff1a\r\n\u7b2c 1 \u5217\r\n    \u8bbe\u7f6e hint \u7684\u6765\u6e90\uff0c\u4e0e hint \u6587\u4ef6\u4e2d src \u7684\u503c\u5bf9\u5e94\r\n\u7b2c 2 \u5217\r\n    \u5bf9 hint \u7684\u5f97\u5206(\u5bf9\u5e94 hint \u6587\u4ef6\u7b2c 6 \u5217)\u8fdb\u884c\u4e86\u5206\u7ea7\uff0c\u8be5\u503c\u8868\u660e\u5206\u6210\u4e86\u591a\u5c11\u7ea7\u3002 \u82e5\u8be5\u503c\u4e3a 1\uff0c \u5219\u8868\u793a\u4e0d\u5206\u7ea7\uff0c\u90a3\u4e48\u5f53 hint \u6240\u6709\u7684\u78b1\u57fa\u90fd\u5339\u914d\u7684\u65f6\u5019\uff0c\u5219\u8fdb\u884c\u5956\u52b1\uff0c\u5956\u52b1\u7cfb\u6570\u4e58\u4ee5\u7b2c 3 \u5217\u7684\u503c\uff1b\u82e5\u8be5\u503c\u5927\u4e8e 1\uff0c\u5219\u5c06 hint \u6587\u4ef6\u7b2c 6 \u5217\u7684\u5206\u6570\u5206\u6210\u4e86\u591a\u7ea7\uff1b\u82e5 hint \u6587\u4ef6\u7b2c 6 \u5217\u6ca1\u6709\u5f97\u5206\uff0c\u5219\u8be5\u51fa\u9700\u8981\u8bbe\u7f6e\u4e3a 1 \u3002\r\n\u7b2c 3 \u5217\r\n    \u82e5\u7b2c 2 \u5217\u503c\u4e3a 1\uff0c \u5219\u8be5\u5217\u53ea\u6709\u4e00\u4e2a\u503c\uff0c\u5956\u52b1\u7cfb\u6570\u4e58\u4ee5\u8be5\u503c\uff1b \u82e5\u7b2c 2 \u5217\u4e0d\u4e3a 1\uff0c \u5219\u6b64\u5217\u5206\u6210 2 \u5217\u3002\u4f8b\u5982\uff1a\r\n    D    8     1.5  2.5  3.5  4.5  5.5  6.5  7.5  0.58  0.4  0.2  2.9  0.87  0.44 0.31  7.3\r\n    \u7b2c\u4e00\u5217\u4e3a D\uff0c \u8868\u660e DIALIGN \u7c7b\u578b\u7684 hint\uff1b\r\n    \u7b2c\u4e8c\u5217\u4e3a 8\uff0c \u8868\u660e\u6839\u636e\u5176 hint \u6587\u4ef6\u7684\u7b2c 6 \u5217\uff0c\u5c06\u5956\u52b1\u5206\u6210\u4e86 8 \u4e2a\u7ea7\u522b\uff1b\r\n    \u7531\u4e8e\u7b2c\u4e8c\u5217\u4e0d\u662f 1 \uff0c \u7b2c\u4e09\u5217\u5206\u6210\u4e86 2 \u5217\u3002 \u524d\u4e00\u5217\u662f 7 \u4e2a\u6570\u503c\uff0c\u5c06 hint \u7684\u6253\u5206\u5206\u6210\u4e86 8 \u4e2a\u7ea7\u522b\uff1b\u540e\u4e00\u5217\u662f\u8fd9 8 \u4e2a\u7ea7\u522b\u7684\u5956\u52b1\u7cfb\u6570\u4e58\u79ef\u3002\r\n<\/pre>\n<p>\u5728\u5b58\u5728 hints.gff, extrinsic.cfg \u548c genome.fasta \u8fd9 3 \u4e2a\u6587\u4ef6\uff0c\u4ee5\u53ca HMM \u6587\u4ef6 training \u5b8c\u6bd5\u540e\uff0c\u5373\u53ef\u5f00\u59cb\u8fdb\u884c Augustus Training\u3002\u5f53\u57fa\u56e0\u7ec4\u6bd4\u8f83\u5927\u65f6\uff0c\u6700\u597d\u8fdb\u884c\u5e76\u884c\u8ba1\u7b97\uff1a<\/p>\n<pre>\r\n\u5c06 hints.gff \u6587\u4ef6\u5185\u5bb9\u548c genome.fasta \u5185\u5bb9\u8fdb\u884c\u5206\u5272\u3002\u4e0d\u5bf9\u5b8c\u6574\u7684\u5e8f\u5217\u8fdb\u884c\u5207\u65ad\u3002\u6b64\u7a0b\u5e8f\u5c06\u57fa\u56e0\u7ec4\u5e8f\u5217\u6309\u957f\u5ea6\u8fdb\u884c\u6392\u5e8f\u540e\uff0c\u5c06\u5e8f\u5217\u5199\u5165\u5230\u4e00\u4e2a\u4e2a\u5206\u5272\u7684fasta\u6587\u4ef6\u4e2d\u3002\u6bcf\u5f53\u5199\u5165\u5230\u4e00\u4e2afasta\u6587\u4ef6\u7684\u5e8f\u5217\u957f\u5ea6\u5927\u4e8e\u8bbe\u7f6e\u7684\u503c\u65f6\uff0c\u5219\u5c06\u4e0b\u4e00\u6761\u5e8f\u5217\u4e0b\u5982\u4e0b\u4e00\u4e2afasta\u6587\u4ef6\u3002\u540c\u65f6\uff0c\u4e5f\u5c06\u76f8\u5e94\u7684 hints \u4fe1\u606f\u5199\u5165\u5bf9\u5e94\u7684 hint \u6587\u4ef6\u3002\r\n$ perl split_hints_and_scaffolds_for_augustus.pl --minsize 1000000 --output split genome.fasta hints.gff\r\n\r\n\u5e76\u884c\u8ba1\u7b97\r\n$ for x in `ls split\/*.fa | perl -pe 's\/.*\\\/\/\/; s\/.fa\/\/' | sort -k 1.14n`\r\ndo\r\n    echo \"augustus --species=my_species --extrinsicCfgFile=extrinsic.cfg --alternatives-from-evidence=true --hintsfile=split\/$x.hints --allow_hinted_splicesites=atac --alternatives-from-evidence=true --gff3=on --UTR=on split\/$x.fa > split\/$x.out\" >> command_augustus.list\r\ndone\r\n$ ParaFly -c command_augustus.list -CPU 12\r\n \r\n\u5408\u5e76\u7ed3\u679c\r\n$ for x in `ls split\/*.out | sort -k 1.20n`\r\ndo\r\n    cat $x >> aug.out\r\ndone\r\n\r\n$ join_aug_pred.pl aug.out > aug.gff3\r\n<\/pre>\n","protected":false},"excerpt":{"rendered":"<p>1. Augustus training \u9996\u9009\uff0c\u9700\u8981\u6709\u81f3\u5c11 200 \u4e2a\u5b8c\u6574\u57fa\u56e0\u6a21 &hellip; <a href=\"http:\/\/www.chenlianfu.com\/?p=2307\">\u7ee7\u7eed\u9605\u8bfb <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[3],"tags":[72],"_links":{"self":[{"href":"http:\/\/www.chenlianfu.com\/index.php?rest_route=\/wp\/v2\/posts\/2307"}],"collection":[{"href":"http:\/\/www.chenlianfu.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/www.chenlianfu.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/www.chenlianfu.com\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/www.chenlianfu.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=2307"}],"version-history":[{"count":1,"href":"http:\/\/www.chenlianfu.com\/index.php?rest_route=\/wp\/v2\/posts\/2307\/revisions"}],"predecessor-version":[{"id":2308,"href":"http:\/\/www.chenlianfu.com\/index.php?rest_route=\/wp\/v2\/posts\/2307\/revisions\/2308"}],"wp:attachment":[{"href":"http:\/\/www.chenlianfu.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=2307"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/www.chenlianfu.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=2307"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/www.chenlianfu.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=2307"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}