1. 了解超几何概率分布
参考:The Hypergeometric Distribution
超几何分布的公式为:
p(x) = choose(m, x) choose(n, k-x) / choose(m+n, k) for x = 0, …, k. 其中, m 是桶里面白球的个数, n 是黑球的个数, k 是从桶中随机取出的球数, x 是取出球 中白球的个数.
Fisher‘s Exact Test就是依据超几何概率分布得到的。
2. 超几何分布运算方法
使用R的运算方法:
dhyper(x, m, n, k, log = FALSE)
计算某一个点的p值
phyper(q, m, n, k, lower.tail = TRUE, log.p = FALSE)
计算一个分布的p值,默认下计算P(X<=x)
qhyper(p, m, n, k, lower.tail = TRUE, log.p = FALSE)
计算某一个p值对应的分位数
rhyper(nn, m, n, k)
按超几何分布给出nn的可能的模拟结果值
3. 对一组p值进行校正
参考:Adjust P-values for Multiple Comparisons
使用R的运算方法:
p.adjust(p, method = p.adjust.methods, n = length(p)) p.adjust.methods # c(“holm”, “hochberg”, “hommel”, “bonferroni”, “BH”, “BY”, # “fdr”, “none”)
4. 利用上述原理来编写perl程序,来进行GO或PATHWAY富集分析
输入两个文件:第1个文件是各个基因对应的GO分类或KEGG分类;第2个文件是差异表达基因的名称。以下为KEGG富集分析的perl程序。
#!/usr/bin/perl
=head1 Name
kegg_enrich_analysis.pl
=head1 Description
This program is designed to do enrichment analysis by kegg results.
=head1 Version
Author: Chen Lianfu, chenllianfu@foxmail.com
Version: 1.0, Date: Sat Apr 27 02:06:02 2013
=head1 Usage
--out the output result name, default: enrichment_output
--tmp to keep the tmp files, default: false
--fdr set the false discovery rate, default: 0.05
--kdn When a kegg term was perpared for Fisher's exact
test, the i(the num of this kegg terms have so many differential
expression genes) should >= this value, default: 3
--help output help information to screen
=head1 Example
perl ../kegg_enrich_analysis.pl annot_keggs genelist.txt
=cut
use strict;
use Getopt::Long;
my ($out,$tmp,$fdr_threshold,$kegg2degs_num_threshold,$Help);
GetOptions(
"out:s"=>\$out,
"fdr:s"=>\$fdr_threshold,
"kdn:s"=>\$kegg2degs_num_threshold,
"tmp"=>\$tmp,
"help"=>\$Help
);
$out ||= 'enrichment_output';
$fdr_threshold ||= 0.05;
$kegg2degs_num_threshold ||= 3;
die `pod2text $0` if (@ARGV == 0 || $Help);
my $kegg2gene_file = shift @ARGV;
my $deglist = shift @ARGV;
open kegg2GENE, '<', $kegg2gene_file or die $!;
open DEGLIST, '<', $deglist or die $!;
my %kegg2genes;
my %gene2keggs;
while ( <kegg2GENE> ) {
next unless /^(\S+).*?(K\d+)/;
$gene2keggs{$1} .= "$2,";
$kegg2genes{$2}{"genes"} .= "$1,";
$kegg2genes{$2}{"num"} += 1;
}
my ( $total_deg_num, $deg_own_kegg_num );
my %deg_kegg2genes;
while ( <DEGLIST> ) {
chomp;
my $genename = $_;
$total_deg_num ++;
if ( $gene2keggs{$genename} ) {
$deg_own_kegg_num ++;
my @keggs = split /,/, $gene2keggs{$genename};
foreach ( @keggs ) {
$deg_kegg2genes{$_}{"genes"} .= "$genename,";
$deg_kegg2genes{$_}{"num"} ++;
}
}
}
print "$total_deg_num differential expression genes in total, and
$deg_own_kegg_num genes own kegg terms.\n";
my @for_enrich_keggs = keys %deg_kegg2genes;
open FOR_R, '>', "tmp_4value" or die $!;
my $kegg_num_for_fisher_test = @for_enrich_keggs;
print "$kegg_num_for_fisher_test kegg terms is perpared for Fisher
's exact test.\n";
foreach my $kegg ( @for_enrich_keggs ) {
my $n_and_m = keys %gene2keggs;
my $N = $deg_own_kegg_num;
my $n = $kegg2genes{$kegg}{"num"};
my $i = $deg_kegg2genes{$kegg}{"num"};
my $m = $n_and_m -$n;
print FOR_R "$kegg $i $n $m $N\n" if $i >= $kegg2degs_num_threshold;
}
open R_SCRIPT, '>', "phyper_adjust.R" or die $!;
print R_SCRIPT
'phy <- read.table(file="tmp_4value")
pvalue <- phyper(phy$V2,phy$V3,phy$V4,phy$V5,lower.tail=FALSE)
qvalue <- p.adjust(pvalue,method=\'fdr\')
value <- cbind ( pvalue, qvalue )
rownames(value)=phy$V1
value
';
my @fdr = `cat phyper_adjust.R | R --vanilla --slave`;
#print @fdr;
`rm phyper_adjust.R tmp_4value` unless $tmp;
open OUTPUT, '>', $out or die $!;
foreach ( @fdr ) {
next unless /(\S+)\s+(\S+)\s+(\S+)/;
my $kegg = $1; my $pvalue = $2; my $fdr = $3;
if ( $fdr <= $fdr_threshold ) {
my $n_and_m = keys %gene2keggs;
my $N = $deg_own_kegg_num;
my $n = $kegg2genes{$kegg}{"num"};
my $i = $deg_kegg2genes{$kegg}{"num"};
my $m = $n_and_m -$n;
my $genes = $deg_kegg2genes{$kegg}{"genes"};
print OUTPUT "$kegg\t$i\t$N\t$n\t$n_and_m\t$pvalue\t$fdr\t$genes\n";
}
}