PASA的安装,配置与主程序使用参数

1. PASA简介

PASA, acronym for Program to Assemble Spliced Alignments, is a eukaryotic genome annotation tool that exploits spliced alignments of expressed transcript sequences to automatically model gene structures, and to maintain gene structure annotation consistent with the most recently available experimental sequence data. PASA also identifies and classifies all splicing variations supported by the transcript alignments.

Note:
Combine genome and Trinity de novo RNA-Seq assemblies to generate a comprehensive transcript database.

2. PASA使用前的准备

2.1 Mysql数据库的准备

创建只读权限用户和所有权限用户各一个。

mysql> GRANT SELECT ON *.* TO 'pasa'@'%' IDENTIFIED BY '123456';
mysql> GRANT ALL ON *.* TO 'chenlianfu'@'%' IDENTIFIED BY '123456';
mysql> FLUSH PRIVILEGES;

2.1 安装perl模块

# cpan
cpan[1]> install DBD::mysql
cpan[1]> install GD

2.3 安装GMAP

$ wget http://research-pub.gene.com/gmap/src/gmap-gsnap-2013-03-31.v5.tar.gz
$ tar zxvf gmap-gsnap-2013-03-31.v5.tar.gz
$ cd gmap-2013-03-31
$ ./configure --prefix=$PWD
$ make -j 8
$ make install

2.4 安装BLAT

$ wget http://hgwdev.cse.ucsc.edu/~kent/src/blatSrc35.zip
$ unzip blatSrc35.zip
$ cd blatSrc
$ MACHTYP=x86_64
$ export MACHTYPE
$ mkdir -p ~/bin/x86_64
$ make -j 8

2.5 安装FASTA

$ wget http://faculty.virginia.edu/wrpearson/fasta/fasta3/CURRENT.tar.gz
$ tar zxvf CURRENT.tar.gz
$ cd fasta-35.4.12
$ cd src
$ make -f ../make/Makefile.linux_sse2 all
$ cd ..
$ ln -s $PWD/bin/fasta35 ~/bin/fasta

2.6 安装PASA

$ wget http://kaz.dl.sourceforge.net/project/pasa/PASA2-r20130425beta.tgz
$ tar zxvf PASA2-r20130425beta.tgz
$ cd PASA2-r20130425beta/
$ make -j 8

2.7 安装GD

安装GD需要先行安装libgd

$ wget https://bitbucket.org/libgd/gd-libgd/get/93368566388c.zip
$ unzip 93368566388c.zip
$ cd libgd-gd-libgd-93368566388c
$ ./bootstrap.sh
$ ./configure
$ make -j 8
$ sudo make install
$ gdlib-config

再安装GD

$ wget http://search.cpan.org/CPAN/authors/id/L/LD/LDS/GD-2.49.tar.gz
$ tar zxvf GD-2.49.tar.gz
$ cd GD-2.49
$ perl Makefile.PL
$ make -j 8
$ sudo make install

安装GD的目的是能通过网页来查看PASA的运行结果。

2.8 配置PASA

2.8.1. 修改PASA的配置文件$PASAHOME/pasa_conf/conf.txt

$ cp $PASAHOME/pasa_conf/pasa.CONFIG.template $PASAHOME/pasa_conf/conf.txt
$ vim $PASAHOME/pasa_conf/conf.txt

2.8.2. 该文件需要修改的地方:

PASA_ADMIN_EMAIL=(your email address)
MYSQLSERVER=(your mysql server name)   此处不能填写IP。
MYSQL_RO_USER=(mysql read-only username)
MYSQL_RO_PASSWORD=(mysql read-only password)
MYSQL_RW_USER=(mysql all privileges username)
MYSQL_RW_PASSWORD=(mysql all privileges password)
BASE_PASA_URL=http://server_name/pasa/cgi-bin/

2.8.3. 修改httpd配置文件,

# vim /etc/httpd/conf/httpd.conf
# /etc/init.d/httpd restart

在/etc/httpd/conf/httpd.conf添加如下几行:

ScriptAlias /pasa "$PASAHOME"
<Directory "$PASAHOME">
        Options MultiViews ExecCGI
        AllowOverride None
        Order allow,deny
        Allow from all
</Directory>

2.9 cleaning the transcript sequences[Optional, requires seqclean to be installed

下载两个污染数据库,为fasta文件。

$ cd $PASAHOME/seqclean
$ tar zxf seqclean.tar.gz
$ cd seqclean
$ wget ftp://ftp.ncbi.nih.gov/pub/UniVec/UniVec -O UniVec.fasta
$ wget ftp://ftp.ncbi.nih.gov/pub/UniVec/UniVec_Core -O UniVec_Core.fasta

UniVec_Core includes only oligonucleotides and vectors consisting of bacterial, phage, viral, yeast or synthetic sequences. Vectors that include sequences of mammalian origin are excluded.

3. PASA主程序的使用

PASA的主程序是: $PASAHOME/scripts/Launch_PASA_pipeline.pl, 其使用参数如下:

*代表该参数是必须的

-c <filename> *
比对配置文件。可以将$PASAHOME/pasa_conf/pasa.alignAssembly.Template.
txt复制过来,只是将其中的MYSQLDB修改成需要的mysql数据库名。

####################

spliced alignment settings:
--ALIGNERS <string>
比对的软件,可用的软件有gmap和blat。也可以同时选择使用'gmap,blat'

-N <int> default: 1
max number of top scoring alignments

--MAX_INTRON_LENGTH | -I <int>  default: 100000
max intron length parameter passed to GMAP or BLAT

--IMPORT_CUSTOM_ALIGNMENTS_GFF3 <filename>
only using the alignments supplied in the corresponding GFF3 file.

--cufflinks_gtf <filename>
incorporate cufflinks-generated transcripts

####################

actions
-C
    flag, create MYSQL database
-R
    flag, run alignment/assembly pipeline.
-A
    compare to annotated genes.
--ALT_SPLICE
    flag, run alternative splicing analysis

-R 用于比对transcripts , -A 用于和已有gff3注释文件的比较和更新;这两个参数不
能同时共用,使用不同的参数,则 -C 参数设置不同的参数文件。

####################

input files

-g <filename> *
    genome sequence FASTA file

-t <filename> *
    transcript db

-f <filename>
    file containing a list of fl-cdna accessions.

--TDN <filename>
    file containing a list of accessions corresponding to Trinity
 (full) de novo assemblies (not genome-guided)

####################

polyAdenylation site identification  ** highly recommended **
-T
    flag,transcript db were trimmed using the TGI seqclean tool.
-u <filename>
    value, transcript db containing untrimmed sequences (input to 
seqclean).a filename with a .cln extension should also exist, gen
erated by seqclean.

####################

Jump-starting or prematurely terminating
-x
    flag, print cmds only, don't process anything. (useful to get 
indices for -x or -e opts below)
-s <int>
    pipeline index to start running at (avoid rerunning searches).
-e <int>
    pipeline index where to stop running, and do not execute this 
entry. 

####################

Misc:
--TRANSDECODER
    flag, run transdecoder to identify candidate full-length coding
 transcripts
--CPU <int> default: 2
    multithreading
-d  flag, Debug 
-h  flag, print this option menu and quit

利用超几何概率分布做富集分析

1. 了解超几何概率分布

参考:The Hypergeometric Distribution
超几何分布的公式为:

p(x) = choose(m, x) choose(n, k-x) / choose(m+n, k)
for x = 0, …, k.
其中, m 是桶里面白球的个数, n 是黑球的个数, k 是从桶中随机取出的球数, x 是取出球
中白球的个数.

Fisher‘s Exact Test就是依据超几何概率分布得到的。

2. 超几何分布运算方法

使用R的运算方法:

dhyper(x, m, n, k, log = FALSE)
    计算某一个点的p值
phyper(q, m, n, k, lower.tail = TRUE, log.p = FALSE)
    计算一个分布的p值,默认下计算P(X<=x)
qhyper(p, m, n, k, lower.tail = TRUE, log.p = FALSE)
    计算某一个p值对应的分位数
rhyper(nn, m, n, k)
    按超几何分布给出nn的可能的模拟结果值

3. 对一组p值进行校正

参考:Adjust P-values for Multiple Comparisons
使用R的运算方法:

p.adjust(p, method = p.adjust.methods, n = length(p))

p.adjust.methods
# c(“holm”, “hochberg”, “hommel”, “bonferroni”, “BH”, “BY”,
# “fdr”, “none”)

4. 利用上述原理来编写perl程序,来进行GO或PATHWAY富集分析

输入两个文件:第1个文件是各个基因对应的GO分类或KEGG分类;第2个文件是差异表达基因的名称。以下为KEGG富集分析的perl程序。

 #!/usr/bin/perl

=head1 Name

  kegg_enrich_analysis.pl

=head1 Description

  This program is designed to do enrichment analysis by kegg results.

=head1 Version

  Author: Chen Lianfu, chenllianfu@foxmail.com
  Version: 1.0,  Date: Sat Apr 27 02:06:02 2013

=head1 Usage

  --out         the output result name, default: enrichment_output
  --tmp         to keep the tmp files, default: false
  --fdr         set the false discovery rate, default: 0.05
  --kdn         When a kegg term was perpared for Fisher's exact 
test, the i(the num of this kegg terms have so many differential 
expression genes) should >= this value, default: 3
  --help        output help information to screen

=head1 Example

  perl ../kegg_enrich_analysis.pl annot_keggs genelist.txt

=cut

use strict;
use Getopt::Long;

my ($out,$tmp,$fdr_threshold,$kegg2degs_num_threshold,$Help);

GetOptions(
    "out:s"=>\$out,
    "fdr:s"=>\$fdr_threshold,
    "kdn:s"=>\$kegg2degs_num_threshold,
    "tmp"=>\$tmp,
    "help"=>\$Help
);

$out ||= 'enrichment_output';
$fdr_threshold ||= 0.05;
$kegg2degs_num_threshold ||= 3;

die `pod2text $0` if (@ARGV == 0 || $Help);

my $kegg2gene_file = shift @ARGV;
my $deglist = shift @ARGV;

open kegg2GENE, '<', $kegg2gene_file or die $!;
open DEGLIST, '<', $deglist or die $!;

my %kegg2genes;
my %gene2keggs;
while ( <kegg2GENE> ) {
    next unless /^(\S+).*?(K\d+)/;
    $gene2keggs{$1} .= "$2,";
    $kegg2genes{$2}{"genes"} .= "$1,";
    $kegg2genes{$2}{"num"} += 1;
}

my ( $total_deg_num, $deg_own_kegg_num );
my %deg_kegg2genes;
while ( <DEGLIST> ) {
    chomp;
    my $genename = $_;
    $total_deg_num ++;
    if ( $gene2keggs{$genename} ) {
        $deg_own_kegg_num ++;
        my @keggs = split /,/, $gene2keggs{$genename};

        foreach ( @keggs ) {
            $deg_kegg2genes{$_}{"genes"} .= "$genename,";
            $deg_kegg2genes{$_}{"num"} ++;
        }
    }
}

print "$total_deg_num differential expression genes in total, and 
$deg_own_kegg_num genes own kegg terms.\n";

my @for_enrich_keggs = keys %deg_kegg2genes;
open FOR_R, '>', "tmp_4value" or die $!;

my $kegg_num_for_fisher_test = @for_enrich_keggs;
print "$kegg_num_for_fisher_test kegg terms is perpared for Fisher
's exact test.\n";

foreach my $kegg ( @for_enrich_keggs ) {
    my $n_and_m = keys %gene2keggs;
    my $N = $deg_own_kegg_num;
    my $n = $kegg2genes{$kegg}{"num"};
    my $i = $deg_kegg2genes{$kegg}{"num"};
    my $m = $n_and_m -$n;

    print FOR_R "$kegg $i $n $m $N\n" if $i >= $kegg2degs_num_threshold;
}

open R_SCRIPT, '>', "phyper_adjust.R" or die $!;

print R_SCRIPT 
'phy <- read.table(file="tmp_4value")
pvalue <- phyper(phy$V2,phy$V3,phy$V4,phy$V5,lower.tail=FALSE)
qvalue <- p.adjust(pvalue,method=\'fdr\')
value <- cbind ( pvalue, qvalue )
rownames(value)=phy$V1
value
';

my @fdr = `cat phyper_adjust.R | R --vanilla --slave`;

#print @fdr;

`rm phyper_adjust.R tmp_4value` unless $tmp;

open OUTPUT, '>', $out or die $!;

foreach ( @fdr ) {
    next unless /(\S+)\s+(\S+)\s+(\S+)/;
    my $kegg = $1; my $pvalue = $2; my $fdr = $3;

    if ( $fdr <= $fdr_threshold ) {
        my $n_and_m = keys %gene2keggs;
        my $N = $deg_own_kegg_num;
        my $n = $kegg2genes{$kegg}{"num"};
        my $i = $deg_kegg2genes{$kegg}{"num"};
        my $m = $n_and_m -$n;

        my $genes = $deg_kegg2genes{$kegg}{"genes"};

        print OUTPUT "$kegg\t$i\t$N\t$n\t$n_and_m\t$pvalue\t$fdr\t$genes\n";
    }
}

The Hypergeometric Distribution

Reference: The Hypergeometric Distribution

Description

Density, distribution function, quantile function and random generation for the hypergeometric distribution.

Usage

dhyper(x, m, n, k, log = FALSE)
phyper(q, m, n, k, lower.tail = TRUE, log.p = FALSE)
qhyper(p, m, n, k, lower.tail = TRUE, log.p = FALSE)
rhyper(nn, m, n, k)

dhper    计算某一个点的p值
phyper   计算一个分布的p值,默认下计算P(X<=x)
qhyper   计算某一个p值对应的分位数
rhyper   按超几何分布给出nn的可能的模拟结果值

Arguments

x, q vector of quantiles representing the number of white balls drawn without replacement from an urn which contains both black and white balls.

m the number of white balls in the urn.

n the number of black balls in the urn.

k the number of balls drawn from the urn.

p probability, it must be between 0 and 1.

nn number of observations. If length(nn) > 1, the length is taken to be the number required.

log, log.p logical; if TRUE, probabilities p are given as log(p).

lower.tail logical; if TRUE (default), probabilities are P[X ≤ x], otherwise, P[X > x].

Details

The hypergeometric distribution is used for sampling without replacement. The density of this distribution with parameters m, n and k (named Np, N-Np, and n, respectively in the reference below) is given by

p(x) = choose(m, x) choose(n, k-x) / choose(m+n, k)
for x = 0, …, k.

The quantile is defined as the smallest value x such that F(x) ≥ p, where F is the distribution function.

Value

dhyper gives the density, phyper gives the distribution function, qhyper gives the quantile function, and rhyper generates random deviates.

Invalid arguments will result in return value NaN, with a warning.

The length of the result is determined by n for rhyper, and is the maximum of the lengths of the numerical parameters for the other functions.

The numerical parameters other than n are recycled to the length of the result. Only the first elements of the logical parameters are used.

Source

dhyper computes via binomial probabilities, using code contributed by Catherine Loader (see dbinom).

phyper is based on calculating dhyper and phyper(...)/dhyper(...) (as a summation), based on ideas of Ian Smith and Morten Welinder.

qhyper is based on inversion.

rhyper is based on a corrected version of

Kachitvichyanukul, V. and Schmeiser, B. (1985). Computer generation of hypergeometric random variates. Journal of Statistical Computation and Simulation, 22, 127–145.

References

Johnson, N. L., Kotz, S., and Kemp, A. W. (1992) Univariate Discrete Distributions, Second Edition. New York: Wiley.
See Also
Distributions for other standard distributions.

Examples

m x rbind(phyper(x, m, n, k), dhyper(x, m, n, k))
all(phyper(x, m, n, k) == cumsum(dhyper(x, m, n, k))) # FALSE
## but error is very small:
signif(phyper(x, m, n, k) - cumsum(dhyper(x, m, n, k)), digits = 3)