gene structure prediction methods

基因预测的方法有很多。维基百科:http://en.wikipedia.org/wiki/List_of_gene_prediction_softwarehttp://www.geneprediction.org上提供了大量的基因预测的软件和指导书籍

以下介绍主要的真核生物基因预测的几个方法:

1. Augustus

AUGUSTUS ususally belongs to the most accurate programs for the species it is trained for. Often it is the most accurate ab initio program.

AUGUSTUS的发展很快,版本不断更新,功能越来越完善。AUGUSTUS能很好地结合RNA-seq数据得到的hints.gff进行基因预测。

2. GeneMark-ES

GeneMark-ES使用起来最简单,唯一的能以基因组自我训练出训练集用于基因预测。该软件不提供使用已有模型来预测基因。对应的使用模型的版本为GeneMark.hmm eukaryotic,但是能使用的模型太少,不提供其建模的方法。

3. FGENESH

FGENSH是商业软件,由Softberry公司负责维护和发布,用户只能联网小量使用。若需要大量使用或本地化运行,则需要付费。找了半天,官网不提供软件的下载。使用网页版的FGENESH去进行基因组基本的基因预测,很长时间后返回错误:Request not processed 413 Request entity too large。

4. mGene[.web]

虽然是新近的软件,但是该软件现在还不成熟,使用不易。该软件精确度貌似较高,特别是在下线虫中,和Augustus差不多了。

5. SNAP

Ian Korf独自写了该软件,2004年发表在BMC Bioinformatics上,引用次数208.

该软件安装简单,软件说明较少,作者只是简单在README中提到怎么去建HMM的模型。

6. Glimmer[HMM]

使用Glimmerhmm,应该还不如使用SNAP。

7. geneid

geneid的training的文档中写得很是繁琐,不想看。

PASA的使用

前文讲述了:PASA的安装,配置与主程序使用参数。本文则重点讲述PASA的几种平行的用法:

最简单地理解PASA的用法:PASA是通过调用 blat 或 gsnap 来将 transcripts sequences 比对到 genome 上,从而进行基因的结构注释(gff文件形式)。

1. 普通用法

使用transcripts sequences来进行基因预测,需要2个(或3个)输入的文件:

The genome sequence in a multiFasta file (ie. genome.fasta)

The transcript sequences in a multiFasta file (ie. transcripts.fasta)

Optional: a file containing the list of accessions corresponding to full-length cDNAs (ie. FL_accs.txt)该文件是第二个文件中属于全长CDNAs的序列名的集合,txt文档。

其使用步骤为:

1.1 使用seqclean来对transcript sequences进行clean

通过污染数据库vectors.fasta来对transcript sequences可能被污染的数据进行清除。

$ seqclean transcripts.fasta -v vectors.fasta

1.2 运行PASA

$ cp $PASA_HOME/pasa_conf/pasa.alignAssembly.Template.txt alignAssembly.config
$ perl -p -i -e 's/MYSQLDB=.*/MYSQLDB=sample_mydb_pasa/' alignAssembly.config
$ $PASA_HOME/scripts/Launch_PASA_pipeline.pl -c alignAssembly.config -C -R -g genome_sample.fasta -t all_transcripts.fasta.clean -T -u all_transcripts.fasta -f FL_accs.txt --ALIGNERS blat,gmap --CPU 2

1.3 PASA结果

sample_mydb_pasa.assemblies.fasta : the PASA assemblies in FASTA format.

sample_mydb_pasa.pasa_assemblies.gff3,.gtf,.bed : the PASA assembly structures.

sample_mydb_pasa.pasa_alignment_assembly_building.ascii_illustrations.out : descriptions of alignment assemblies and how they were constructed from the underlying transcript alignments.

sample_mydb_pasa.pasa_assemblies_described.txt : tab-delimited format describing the contents of the PASA assemblies, including the identity of those transcripts that were assembled into the corresponding structure.

2. PASA结合RNA-seq数据来进行基因预测

由RNA-seq数据可以得到transcript sequences,以RNA-seq数据使用Trinity组装过后的转录组结果可以按第一种方法进行PASA分析。在此,更好的方法是: 利用PASA,使用De novo转录组和genome来建立一个综合转录组数据库。

需要2个(或3个)的输入文件:

1. Trinity de novo RNA-Seq assemblies (ex. Trinity.fasta)
2. Trinity genome-guided RNA-Seq assemblies (ex. Trinity.GG.fasta)
3. (optionally)cufflinks transcript structures (ex. cufflinks.gtf).

使用步骤:
2.1 Concatenate the Trinity.fasta and Trinity.GG.fasta files into a single transcripts.fasta file.

Trinity.GG.fasta文件可以是 Trinity genome-guided 过程中产生的文件Trinity_GG.fasta, 如果使用最后PASA得到的assembly结果文件,则需要修改fasta文件的头。

cat Trinity.fasta Trinity.GG.fasta > transcripts.fasta

2.2 Create a file containing the list of transcript accessions that correspond to the Trinity de novo assembly (full de novo, not genome-guided).

$PASA_HOME/misc_utilities/accession_extractor.pl < Trinity.fasta > tdn.accs

2.3 Run PASA using RNA-Seq related options as described in the section above, but include the parameter setting –TDN tdn.accs. To (optionally) include cufflinks-generated transcript structures, further include the parameter setting –cufflinks_gtf cufflinks.gtf. Note, cufflinks may not be appropriate for gene-dense targets, such as in fungi; cufflinks excels when applied to vertebrate genomes, so best to include when applying to mouse or human.

$ $PASA_HOME/scripts/Launch_PASA_pipeline.pl -c alignAssembly.config -t transcripts.fasta -C -R -g genome_sample.fasta --ALIGNERS blat,gmap --CPU 2 --TDN tdn.accs --cufflinks_gtf cufflinks.gtf

2.4 After completing the PASA alignment assembly, generate the comprehensive transcriptome database via:

$PASA_HOME/PASA/scripts/build_comprehensive_transcriptome.dbi -c alignAssembly.config -t transcripts.fasta --min_per_ID 95 --min_per_aligned 30

2.5 结果文件
compreh_init_build/compreh_init_build.fasta : the transcript sequences
compreh_init_build/compreh_init_build.geneToTrans_mapping : the gene/transcript mapping file (for use with RSEM, Trinotate, other tools)

compreh_init_build/compreh_init_build.bed : transcript structures in bed format
compreh_init_build/compreh_init_build.gff3 : transcript structures in gff3 format

compreh_init_build/compreh_init_build.details : classifications of transcripts according to genome mapping status.
2.6 使用感受:

建立一个综合的transcritome数据库,起始是没将de novo的transcripts用于PASA的运行,用于PASA运行是的Trinity_GG.fasta。最后的结果fasta文件是单独使用Trinity_GG.fasta运行PASA得到的gff文件,fasta文件是单独运行得到的assembly.fasta文件和Trinity.fasta相加的结果。个人觉得这样做没什么意义!虽然,文档写道这样做的意义是有利于下游基因差异表达的分析。还不如就使用Trinity_GG.fasta单独去运行PASA。

3. Annotation Comparisons and Annotation Updates

Incorporating PASA Assemblies into Existing Gene Predictions, Changing Exons, Adding UTRs and Alternatively Spliced Models

The PASA software can update any preexisting set of protein-coding gene annotations to incorporate the PASA alignment evidence, correcting exon boundaries, adding UTRs, and models for alternative splicing based on the PASA alignment assemblies generated above.

步骤如下:

3.1 将protein-coding gene annotations上载到数据库

验证gff3文件的兼容性,如果会报告错误和警示,则需要修改gff3文件.

$ $PASA_HOME/misc_utilities/pasa_gff3_validator.pl orig_annotations_sample.gff3

将基因注释文件orig_annotations_sample.gff3上载到数据库。note that you should only feed protein-coding genes to PASA using the loader。

$ $PASA_HOME/scripts/Load_Current_Gene_Annotations.dbi -c alignAssembly.config -g genome_sample.fasta -P orig_annotations_sample.gff3

继续检查一遍兼容性,上载之后,则不会有任何返回到标准输出。

3.2 Performing an annotation comparison and generating an updated gene set

运行PASA主程序,使用注释的config文件(将其中的MYSQLDB的值更改的和比对的config文件一致)

$PASA_HOME/scripts/Launch_PASA_pipeline.pl -c annotCompare.config -A -g genome_sample.fasta -t all_transcripts.fasta.clean

3.3 运行结果

Once the annotation comparison is complete, PASA will output a new GFF3 file that contains the PASA-updated version of the genome annotation, including those gene models successfully updated by PASA, and those that remained untouched. This file will be named ${mysql_db}.gene_structures_post_PASA_updates.$pid.gff3, where $pid is the process ID for this annotation comparison computation.

当然,也可以再次看看PASA的Web Portal.报告已经发生了改变。

4. Identification and Classification of All Alternative Splicing Variations

在PASA进行transcripts比对到genome时,可以进行可变剪切预测。加入选项–ALT_SPLICE。

5. Extraction of ORFs from PASA assemblies (reference ORFs for training gene predictors)

PASA可以从用于自动提取编码蛋白基因的基因结构,用于训练基因集合,来用于其它基因预测软件。To do this, the longest ORF (min 100 amino acids, configurable) is found within each PASA alignment assembly, allowing for 5′ and 3′ partials. The top 500 longest ORFs (configurable) are extracted and, using these in addition to the full set of genome sequences, log-likelihood scores for a hexamer-based Markov model are computed. All ORFs are scored using this Markov model, and those ORFs that are most likely coding are reported as best. The scoring method for ORF identification is similar to that used in the GeneID software and nicely described in the methods section of the GeneID publication.

在运行过PASA过后,则可以进行ORFs的提取:

$ export PERL5LIB $PERL5LIB:$PASAHOME/PerlLib/
$ $PASAHOME/scripts/pasa_asmbls_to_training_set.dbi -M "$mysql_db_name:$mysql_server_name" -p "$user:$password" -g genome_sample.fasta

运行结果:
trainingSetCandidates.cds,.pep,.gff3 :corresponds to all longest ORFs of minimum length extracted from the PASA alignment assemblies.

trainingSetCandidates.cds.top_500_longest :corresponds to the 500 top longest ORFs, thought to be most likely to represent correct protein-coding genes.

trainingSetCandidates.cds.scores
:scores for all ORFs based on the Markov model.

best_candidates.cds,.pep,.gff3,.bed :the best candidates based on their coding scores compared to random sequences of the same nucleotide composition.

Genome-guided Trinity for Gene Structure Annotation

使用genome来引导Trinity进行基因结构注释。

RNA-seq的一个主要用途是识别基因组的转录区,重构转录子结构,同时,鉴定转录子的可变剪切。

现在最新的基于genome的转录子预测方法是将RNA-seq的reads使用剪接比对的方法比对到基因组,然后组装比对结果从而得到转录子的结构。(eg. cufflinks, scripture)。我们将这种方法称为:align-reads then assemble-alignments

Trinity可以进行不需要参考基因组的de novo组装,见:Trinity的安装与使用;也能进行有参考基因组支持的组装:即将RNA-Seq比对到genome、RNA-Seq read的de novo组装 和 转录子比对 结合起来。

1. 步骤

1.1 align-reads

使用GSNAP来将reads比对到基因组。将基因组分成各个被reads覆盖的区。

1.2 assemble-reads

对每个区使用Trinity对相应的reads进行组装。

1.3 align-transcripts

使用PASA软件调用GMAP来将Trinity-assembled transcripts比对到genome.

1.4 assemble-transcript_alignments

使用PASA软件来组装上一步骤的比对结果,得出完整的转录子结构,同时,也能解析可变剪接的转录子结构。该步骤和上一步骤其实是在同一个PASA程序中执行得到的。

2. 需要的软件

Trinity
GSNAP & GMAP
PASA

3. 运行

Below, we describe the steps required for running the genome-guided Trinity-based transcript reconstruction pipeline. 适合于真菌物种,其基因密度较大。

3.1 Align RNA-Seq reads to the genome

$ $TRINITY_HOME/util/alignReads.pl --seqType fq --left reads.left.fq --right reads.right.fq --target genome.fasta --aligner gsnap -- -t 8
$ samtools view gsnap_out/gsnap.coordSorted.bam > gsnap.coordSorted.sam

3.2 Assemble the aligned reads using Trinity

$ % $TRINITY_HOME/util/prep_rnaseq_alignments_for_genome_assisted_assembly.pl --SS_lib_type FR --coord_sorted_SAM gsnap.coordSorted.sam -I 1000000
$ find Dir_* -name "*reads" > read_files.list
$ $TRINITY_HOME/util/GG_write_trinity_cmds.pl --reads_list_file read_files.list --paired --SS --jaccard_clip > trinity_GG.cmds
$ $TRINITY_HOME/Inchworm/bin/ParaFly -c trinity_GG.cmds -CPU 6 -failed_cmds trinity_GG.cmds.failed -v
$ find Dir_*  -name "*inity.fasta" -exec cat {} + | $TRINITY_HOME/util/inchworm_accession_incrementer.pl > Trinity_GG.fasta

3.3 Align and assemble the Trinity-reconstructed transcripts using the PASA pipeline

$ cp $PASA_HOME/pasa_conf/pasa.alignAssembly.Template.txt alignAssembly.config
$ perl -p -i -e 's/MYSQLDB=.*/MYSQLDB=sample_mysql_database/' alignAssembly.config
$ $PASA_HOME/scripts/Launch_PASA_pipeline.pl -c alignAssembly.config -C -R -g genome.fasta -t Trinity_GG.fasta --ALIGNERS blat,gmap --transcribed_is_aligned_orient --stringent_alignment_overlap 30.0

PASA的安装,配置与主程序使用参数

1. PASA简介

PASA, acronym for Program to Assemble Spliced Alignments, is a eukaryotic genome annotation tool that exploits spliced alignments of expressed transcript sequences to automatically model gene structures, and to maintain gene structure annotation consistent with the most recently available experimental sequence data. PASA also identifies and classifies all splicing variations supported by the transcript alignments.

Note:
Combine genome and Trinity de novo RNA-Seq assemblies to generate a comprehensive transcript database.

2. PASA使用前的准备

2.1 Mysql数据库的准备

创建只读权限用户和所有权限用户各一个。

mysql> GRANT SELECT ON *.* TO 'pasa'@'%' IDENTIFIED BY '123456';
mysql> GRANT ALL ON *.* TO 'chenlianfu'@'%' IDENTIFIED BY '123456';
mysql> FLUSH PRIVILEGES;

2.1 安装perl模块

# cpan
cpan[1]> install DBD::mysql
cpan[1]> install GD

2.3 安装GMAP

$ wget http://research-pub.gene.com/gmap/src/gmap-gsnap-2013-03-31.v5.tar.gz
$ tar zxvf gmap-gsnap-2013-03-31.v5.tar.gz
$ cd gmap-2013-03-31
$ ./configure --prefix=$PWD
$ make -j 8
$ make install

2.4 安装BLAT

$ wget http://hgwdev.cse.ucsc.edu/~kent/src/blatSrc35.zip
$ unzip blatSrc35.zip
$ cd blatSrc
$ MACHTYP=x86_64
$ export MACHTYPE
$ mkdir -p ~/bin/x86_64
$ make -j 8

2.5 安装FASTA

$ wget http://faculty.virginia.edu/wrpearson/fasta/fasta3/CURRENT.tar.gz
$ tar zxvf CURRENT.tar.gz
$ cd fasta-35.4.12
$ cd src
$ make -f ../make/Makefile.linux_sse2 all
$ cd ..
$ ln -s $PWD/bin/fasta35 ~/bin/fasta

2.6 安装PASA

$ wget http://kaz.dl.sourceforge.net/project/pasa/PASA2-r20130425beta.tgz
$ tar zxvf PASA2-r20130425beta.tgz
$ cd PASA2-r20130425beta/
$ make -j 8

2.7 安装GD

安装GD需要先行安装libgd

$ wget https://bitbucket.org/libgd/gd-libgd/get/93368566388c.zip
$ unzip 93368566388c.zip
$ cd libgd-gd-libgd-93368566388c
$ ./bootstrap.sh
$ ./configure
$ make -j 8
$ sudo make install
$ gdlib-config

再安装GD

$ wget http://search.cpan.org/CPAN/authors/id/L/LD/LDS/GD-2.49.tar.gz
$ tar zxvf GD-2.49.tar.gz
$ cd GD-2.49
$ perl Makefile.PL
$ make -j 8
$ sudo make install

安装GD的目的是能通过网页来查看PASA的运行结果。

2.8 配置PASA

2.8.1. 修改PASA的配置文件$PASAHOME/pasa_conf/conf.txt

$ cp $PASAHOME/pasa_conf/pasa.CONFIG.template $PASAHOME/pasa_conf/conf.txt
$ vim $PASAHOME/pasa_conf/conf.txt

2.8.2. 该文件需要修改的地方:

PASA_ADMIN_EMAIL=(your email address)
MYSQLSERVER=(your mysql server name)   此处不能填写IP。
MYSQL_RO_USER=(mysql read-only username)
MYSQL_RO_PASSWORD=(mysql read-only password)
MYSQL_RW_USER=(mysql all privileges username)
MYSQL_RW_PASSWORD=(mysql all privileges password)
BASE_PASA_URL=http://server_name/pasa/cgi-bin/

2.8.3. 修改httpd配置文件,

# vim /etc/httpd/conf/httpd.conf
# /etc/init.d/httpd restart

在/etc/httpd/conf/httpd.conf添加如下几行:

ScriptAlias /pasa "$PASAHOME"
<Directory "$PASAHOME">
        Options MultiViews ExecCGI
        AllowOverride None
        Order allow,deny
        Allow from all
</Directory>

2.9 cleaning the transcript sequences[Optional, requires seqclean to be installed

下载两个污染数据库,为fasta文件。

$ cd $PASAHOME/seqclean
$ tar zxf seqclean.tar.gz
$ cd seqclean
$ wget ftp://ftp.ncbi.nih.gov/pub/UniVec/UniVec -O UniVec.fasta
$ wget ftp://ftp.ncbi.nih.gov/pub/UniVec/UniVec_Core -O UniVec_Core.fasta

UniVec_Core includes only oligonucleotides and vectors consisting of bacterial, phage, viral, yeast or synthetic sequences. Vectors that include sequences of mammalian origin are excluded.

3. PASA主程序的使用

PASA的主程序是: $PASAHOME/scripts/Launch_PASA_pipeline.pl, 其使用参数如下:

*代表该参数是必须的

-c <filename> *
比对配置文件。可以将$PASAHOME/pasa_conf/pasa.alignAssembly.Template.
txt复制过来,只是将其中的MYSQLDB修改成需要的mysql数据库名。

####################

spliced alignment settings:
--ALIGNERS <string>
比对的软件,可用的软件有gmap和blat。也可以同时选择使用'gmap,blat'

-N <int> default: 1
max number of top scoring alignments

--MAX_INTRON_LENGTH | -I <int>  default: 100000
max intron length parameter passed to GMAP or BLAT

--IMPORT_CUSTOM_ALIGNMENTS_GFF3 <filename>
only using the alignments supplied in the corresponding GFF3 file.

--cufflinks_gtf <filename>
incorporate cufflinks-generated transcripts

####################

actions
-C
    flag, create MYSQL database
-R
    flag, run alignment/assembly pipeline.
-A
    compare to annotated genes.
--ALT_SPLICE
    flag, run alternative splicing analysis

-R 用于比对transcripts , -A 用于和已有gff3注释文件的比较和更新;这两个参数不
能同时共用,使用不同的参数,则 -C 参数设置不同的参数文件。

####################

input files

-g <filename> *
    genome sequence FASTA file

-t <filename> *
    transcript db

-f <filename>
    file containing a list of fl-cdna accessions.

--TDN <filename>
    file containing a list of accessions corresponding to Trinity
 (full) de novo assemblies (not genome-guided)

####################

polyAdenylation site identification  ** highly recommended **
-T
    flag,transcript db were trimmed using the TGI seqclean tool.
-u <filename>
    value, transcript db containing untrimmed sequences (input to 
seqclean).a filename with a .cln extension should also exist, gen
erated by seqclean.

####################

Jump-starting or prematurely terminating
-x
    flag, print cmds only, don't process anything. (useful to get 
indices for -x or -e opts below)
-s <int>
    pipeline index to start running at (avoid rerunning searches).
-e <int>
    pipeline index where to stop running, and do not execute this 
entry. 

####################

Misc:
--TRANSDECODER
    flag, run transdecoder to identify candidate full-length coding
 transcripts
--CPU <int> default: 2
    multithreading
-d  flag, Debug 
-h  flag, print this option menu and quit

利用超几何概率分布做富集分析

1. 了解超几何概率分布

参考:The Hypergeometric Distribution
超几何分布的公式为:

p(x) = choose(m, x) choose(n, k-x) / choose(m+n, k)
for x = 0, …, k.
其中, m 是桶里面白球的个数, n 是黑球的个数, k 是从桶中随机取出的球数, x 是取出球
中白球的个数.

Fisher‘s Exact Test就是依据超几何概率分布得到的。

2. 超几何分布运算方法

使用R的运算方法:

dhyper(x, m, n, k, log = FALSE)
    计算某一个点的p值
phyper(q, m, n, k, lower.tail = TRUE, log.p = FALSE)
    计算一个分布的p值,默认下计算P(X<=x)
qhyper(p, m, n, k, lower.tail = TRUE, log.p = FALSE)
    计算某一个p值对应的分位数
rhyper(nn, m, n, k)
    按超几何分布给出nn的可能的模拟结果值

3. 对一组p值进行校正

参考:Adjust P-values for Multiple Comparisons
使用R的运算方法:

p.adjust(p, method = p.adjust.methods, n = length(p))

p.adjust.methods
# c(“holm”, “hochberg”, “hommel”, “bonferroni”, “BH”, “BY”,
# “fdr”, “none”)

4. 利用上述原理来编写perl程序,来进行GO或PATHWAY富集分析

输入两个文件:第1个文件是各个基因对应的GO分类或KEGG分类;第2个文件是差异表达基因的名称。以下为KEGG富集分析的perl程序。

 #!/usr/bin/perl

=head1 Name

  kegg_enrich_analysis.pl

=head1 Description

  This program is designed to do enrichment analysis by kegg results.

=head1 Version

  Author: Chen Lianfu, chenllianfu@foxmail.com
  Version: 1.0,  Date: Sat Apr 27 02:06:02 2013

=head1 Usage

  --out         the output result name, default: enrichment_output
  --tmp         to keep the tmp files, default: false
  --fdr         set the false discovery rate, default: 0.05
  --kdn         When a kegg term was perpared for Fisher's exact 
test, the i(the num of this kegg terms have so many differential 
expression genes) should >= this value, default: 3
  --help        output help information to screen

=head1 Example

  perl ../kegg_enrich_analysis.pl annot_keggs genelist.txt

=cut

use strict;
use Getopt::Long;

my ($out,$tmp,$fdr_threshold,$kegg2degs_num_threshold,$Help);

GetOptions(
    "out:s"=>\$out,
    "fdr:s"=>\$fdr_threshold,
    "kdn:s"=>\$kegg2degs_num_threshold,
    "tmp"=>\$tmp,
    "help"=>\$Help
);

$out ||= 'enrichment_output';
$fdr_threshold ||= 0.05;
$kegg2degs_num_threshold ||= 3;

die `pod2text $0` if (@ARGV == 0 || $Help);

my $kegg2gene_file = shift @ARGV;
my $deglist = shift @ARGV;

open kegg2GENE, '<', $kegg2gene_file or die $!;
open DEGLIST, '<', $deglist or die $!;

my %kegg2genes;
my %gene2keggs;
while ( <kegg2GENE> ) {
    next unless /^(\S+).*?(K\d+)/;
    $gene2keggs{$1} .= "$2,";
    $kegg2genes{$2}{"genes"} .= "$1,";
    $kegg2genes{$2}{"num"} += 1;
}

my ( $total_deg_num, $deg_own_kegg_num );
my %deg_kegg2genes;
while ( <DEGLIST> ) {
    chomp;
    my $genename = $_;
    $total_deg_num ++;
    if ( $gene2keggs{$genename} ) {
        $deg_own_kegg_num ++;
        my @keggs = split /,/, $gene2keggs{$genename};

        foreach ( @keggs ) {
            $deg_kegg2genes{$_}{"genes"} .= "$genename,";
            $deg_kegg2genes{$_}{"num"} ++;
        }
    }
}

print "$total_deg_num differential expression genes in total, and 
$deg_own_kegg_num genes own kegg terms.\n";

my @for_enrich_keggs = keys %deg_kegg2genes;
open FOR_R, '>', "tmp_4value" or die $!;

my $kegg_num_for_fisher_test = @for_enrich_keggs;
print "$kegg_num_for_fisher_test kegg terms is perpared for Fisher
's exact test.\n";

foreach my $kegg ( @for_enrich_keggs ) {
    my $n_and_m = keys %gene2keggs;
    my $N = $deg_own_kegg_num;
    my $n = $kegg2genes{$kegg}{"num"};
    my $i = $deg_kegg2genes{$kegg}{"num"};
    my $m = $n_and_m -$n;

    print FOR_R "$kegg $i $n $m $N\n" if $i >= $kegg2degs_num_threshold;
}

open R_SCRIPT, '>', "phyper_adjust.R" or die $!;

print R_SCRIPT 
'phy <- read.table(file="tmp_4value")
pvalue <- phyper(phy$V2,phy$V3,phy$V4,phy$V5,lower.tail=FALSE)
qvalue <- p.adjust(pvalue,method=\'fdr\')
value <- cbind ( pvalue, qvalue )
rownames(value)=phy$V1
value
';

my @fdr = `cat phyper_adjust.R | R --vanilla --slave`;

#print @fdr;

`rm phyper_adjust.R tmp_4value` unless $tmp;

open OUTPUT, '>', $out or die $!;

foreach ( @fdr ) {
    next unless /(\S+)\s+(\S+)\s+(\S+)/;
    my $kegg = $1; my $pvalue = $2; my $fdr = $3;

    if ( $fdr <= $fdr_threshold ) {
        my $n_and_m = keys %gene2keggs;
        my $N = $deg_own_kegg_num;
        my $n = $kegg2genes{$kegg}{"num"};
        my $i = $deg_kegg2genes{$kegg}{"num"};
        my $m = $n_and_m -$n;

        my $genes = $deg_kegg2genes{$kegg}{"genes"};

        print OUTPUT "$kegg\t$i\t$N\t$n\t$n_and_m\t$pvalue\t$fdr\t$genes\n";
    }
}

The Hypergeometric Distribution

Reference: The Hypergeometric Distribution

Description

Density, distribution function, quantile function and random generation for the hypergeometric distribution.

Usage

dhyper(x, m, n, k, log = FALSE)
phyper(q, m, n, k, lower.tail = TRUE, log.p = FALSE)
qhyper(p, m, n, k, lower.tail = TRUE, log.p = FALSE)
rhyper(nn, m, n, k)

dhper    计算某一个点的p值
phyper   计算一个分布的p值,默认下计算P(X<=x)
qhyper   计算某一个p值对应的分位数
rhyper   按超几何分布给出nn的可能的模拟结果值

Arguments

x, q vector of quantiles representing the number of white balls drawn without replacement from an urn which contains both black and white balls.

m the number of white balls in the urn.

n the number of black balls in the urn.

k the number of balls drawn from the urn.

p probability, it must be between 0 and 1.

nn number of observations. If length(nn) > 1, the length is taken to be the number required.

log, log.p logical; if TRUE, probabilities p are given as log(p).

lower.tail logical; if TRUE (default), probabilities are P[X ≤ x], otherwise, P[X > x].

Details

The hypergeometric distribution is used for sampling without replacement. The density of this distribution with parameters m, n and k (named Np, N-Np, and n, respectively in the reference below) is given by

p(x) = choose(m, x) choose(n, k-x) / choose(m+n, k)
for x = 0, …, k.

The quantile is defined as the smallest value x such that F(x) ≥ p, where F is the distribution function.

Value

dhyper gives the density, phyper gives the distribution function, qhyper gives the quantile function, and rhyper generates random deviates.

Invalid arguments will result in return value NaN, with a warning.

The length of the result is determined by n for rhyper, and is the maximum of the lengths of the numerical parameters for the other functions.

The numerical parameters other than n are recycled to the length of the result. Only the first elements of the logical parameters are used.

Source

dhyper computes via binomial probabilities, using code contributed by Catherine Loader (see dbinom).

phyper is based on calculating dhyper and phyper(...)/dhyper(...) (as a summation), based on ideas of Ian Smith and Morten Welinder.

qhyper is based on inversion.

rhyper is based on a corrected version of

Kachitvichyanukul, V. and Schmeiser, B. (1985). Computer generation of hypergeometric random variates. Journal of Statistical Computation and Simulation, 22, 127–145.

References

Johnson, N. L., Kotz, S., and Kemp, A. W. (1992) Univariate Discrete Distributions, Second Edition. New York: Wiley.
See Also
Distributions for other standard distributions.

Examples

m x rbind(phyper(x, m, n, k), dhyper(x, m, n, k))
all(phyper(x, m, n, k) == cumsum(dhyper(x, m, n, k))) # FALSE
## but error is very small:
signif(phyper(x, m, n, k) - cumsum(dhyper(x, m, n, k)), digits = 3)