陈连福的生信博客

第22期培训班将于2024.01.27-2024.02.05期间在武汉市举办！

利用超几何概率分布做富集分析

发表于2013 年 5 月 2 日由chenlianfu

1. 了解超几何概率分布

参考：The Hypergeometric Distribution
超几何分布的公式为：

p(x) = choose(m, x) choose(n, k-x) / choose(m+n, k)
for x = 0, …, k.
其中, m 是桶里面白球的个数, n 是黑球的个数, k 是从桶中随机取出的球数, x 是取出球
中白球的个数.

Fisher‘s Exact Test就是依据超几何概率分布得到的。

2. 超几何分布运算方法

使用R的运算方法：

dhyper(x, m, n, k, log = FALSE)
    计算某一个点的p值
phyper(q, m, n, k, lower.tail = TRUE, log.p = FALSE)
    计算一个分布的p值，默认下计算P(X<=x)
qhyper(p, m, n, k, lower.tail = TRUE, log.p = FALSE)
    计算某一个p值对应的分位数
rhyper(nn, m, n, k)
    按超几何分布给出nn的可能的模拟结果值

3. 对一组p值进行校正

参考：Adjust P-values for Multiple Comparisons
使用R的运算方法：

p.adjust(p, method = p.adjust.methods, n = length(p))

p.adjust.methods
# c(“holm”, “hochberg”, “hommel”, “bonferroni”, “BH”, “BY”,
# “fdr”, “none”)

4. 利用上述原理来编写perl程序，来进行GO或PATHWAY富集分析

输入两个文件：第1个文件是各个基因对应的GO分类或KEGG分类；第2个文件是差异表达基因的名称。以下为KEGG富集分析的perl程序。

 #!/usr/bin/perl

=head1 Name

  kegg_enrich_analysis.pl

=head1 Description

  This program is designed to do enrichment analysis by kegg results.

=head1 Version

  Author: Chen Lianfu, chenllianfu@foxmail.com
  Version: 1.0,  Date: Sat Apr 27 02:06:02 2013

=head1 Usage

  --out         the output result name, default: enrichment_output
  --tmp         to keep the tmp files, default: false
  --fdr         set the false discovery rate, default: 0.05
  --kdn         When a kegg term was perpared for Fisher's exact 
test, the i(the num of this kegg terms have so many differential 
expression genes) should >= this value, default: 3
  --help        output help information to screen

=head1 Example

  perl ../kegg_enrich_analysis.pl annot_keggs genelist.txt

=cut

use strict;
use Getopt::Long;

my ($out,$tmp,$fdr_threshold,$kegg2degs_num_threshold,$Help);

GetOptions(
    "out:s"=>\$out,
    "fdr:s"=>\$fdr_threshold,
    "kdn:s"=>\$kegg2degs_num_threshold,
    "tmp"=>\$tmp,
    "help"=>\$Help
);

$out ||= 'enrichment_output';
$fdr_threshold ||= 0.05;
$kegg2degs_num_threshold ||= 3;

die `pod2text $0` if (@ARGV == 0 || $Help);

my $kegg2gene_file = shift @ARGV;
my $deglist = shift @ARGV;

open kegg2GENE, '<', $kegg2gene_file or die $!;
open DEGLIST, '<', $deglist or die $!;

my %kegg2genes;
my %gene2keggs;
while ( <kegg2GENE> ) {
    next unless /^(\S+).*?(K\d+)/;
    $gene2keggs{$1} .= "$2,";
    $kegg2genes{$2}{"genes"} .= "$1,";
    $kegg2genes{$2}{"num"} += 1;
}

my ( $total_deg_num, $deg_own_kegg_num );
my %deg_kegg2genes;
while ( <DEGLIST> ) {
    chomp;
    my $genename = $_;
    $total_deg_num ++;
    if ( $gene2keggs{$genename} ) {
        $deg_own_kegg_num ++;
        my @keggs = split /,/, $gene2keggs{$genename};

        foreach ( @keggs ) {
            $deg_kegg2genes{$_}{"genes"} .= "$genename,";
            $deg_kegg2genes{$_}{"num"} ++;
        }
    }
}

print "$total_deg_num differential expression genes in total, and 
$deg_own_kegg_num genes own kegg terms.\n";

my @for_enrich_keggs = keys %deg_kegg2genes;
open FOR_R, '>', "tmp_4value" or die $!;

my $kegg_num_for_fisher_test = @for_enrich_keggs;
print "$kegg_num_for_fisher_test kegg terms is perpared for Fisher
's exact test.\n";

foreach my $kegg ( @for_enrich_keggs ) {
    my $n_and_m = keys %gene2keggs;
    my $N = $deg_own_kegg_num;
    my $n = $kegg2genes{$kegg}{"num"};
    my $i = $deg_kegg2genes{$kegg}{"num"};
    my $m = $n_and_m -$n;

    print FOR_R "$kegg $i $n $m $N\n" if $i >= $kegg2degs_num_threshold;
}

open R_SCRIPT, '>', "phyper_adjust.R" or die $!;

print R_SCRIPT 
'phy <- read.table(file="tmp_4value")
pvalue <- phyper(phy$V2,phy$V3,phy$V4,phy$V5,lower.tail=FALSE)
qvalue <- p.adjust(pvalue,method=\'fdr\')
value <- cbind ( pvalue, qvalue )
rownames(value)=phy$V1
value
';

my @fdr = `cat phyper_adjust.R | R --vanilla --slave`;

#print @fdr;

`rm phyper_adjust.R tmp_4value` unless $tmp;

open OUTPUT, '>', $out or die $!;

foreach ( @fdr ) {
    next unless /(\S+)\s+(\S+)\s+(\S+)/;
    my $kegg = $1; my $pvalue = $2; my $fdr = $3;

    if ( $fdr <= $fdr_threshold ) {
        my $n_and_m = keys %gene2keggs;
        my $N = $deg_own_kegg_num;
        my $n = $kegg2genes{$kegg}{"num"};
        my $i = $deg_kegg2genes{$kegg}{"num"};
        my $m = $n_and_m -$n;

        my $genes = $deg_kegg2genes{$kegg}{"genes"};

        print OUTPUT "$kegg\t$i\t$N\t$n\t$n_and_m\t$pvalue\t$fdr\t$genes\n";
    }
}

发表评论取消回复

此站点使用Akismet来减少垃圾评论。了解我们如何处理您的评论数据。