基因正选择分析原理

一、正选择分析的目的：

两两基因的密码子序列进行比较，从而计算dN/dS，即omega（ω）值。若该值<1，则表示纯化选择；omega = 1，则中性进化；omega > 1，则正选择。若分析基因在两个物种中的序列，可以计算dN/dS的值，若omega > 1，即表明该基因在物种进化过程中，即由其祖先物种分化成这两个物种时，基因受到了正选择。对于两个物种/序列的正选择分析，比较简单。而实际情况中，要分析的物种数量很多，包含多个类群。这个时候的正选择分析相对复杂些。
对多个物种的基因序列进行正选择分析，若仍然按照两个物种时的要求，即分析该基因在物种进化中是否受到了正选择？这种结果可能不好说清楚。因为该基因可能在某一类群中序列很相似，其两两比较时，omega <= 1；而在另外一类群中两两比较时，很多时候omega > 1。最后软件可以从总体上给一个omega值，该值不可以拿来简单地评价该基因是否受到了正选择。所以，对多个物种进行正选择分析时，没法直接评价该基因是否受到了正选择。正选择只有在进行两两序列比较的时候，才能计算omega值，从而得到结果。
对基因在多个物种上的正选择分析，分析的目的则是：比较某个分枝上祖先节点和后裔节点（可以理解成，对无根树上某分枝两侧的两组物种进行比较，依然属于两两比较），从而计算该分枝的omega值。而在实际数据中，基因在不同的进化分枝上具有不同的omega值，同时在序列不同的位点也具有不同的omega值。目标分枝两侧的物种数量较多时，可以对序列上的每个位点进行omega值分析，从而鉴定正选择位点。所以，对基因在多个物种上的正选择分析，需要同时分析分析目标分枝的omega值和序列位点的omega值，从而判断基因是否受到正选择压。

二、使用PAM对基因进行正选择分析，有三种方法：

PAML site model: 主要用于检测基因中的正选择位点。该方法分析时，认为进化树中各分枝的omega值是一致的，并比较两种模型：(1)模型m1是null model，认为所有位点的omega值<1或=1; (2)模型m2是正选择模型，存在omega <1、=1或> 1的位点。比较两个模型的似然值（lnL）差异，利用卡方检验（自由度为2）算出p值。若p值 < 0.05，则否定null model，认为存在正选择位点。此外，推荐采用比较模型m7和m8，它们将omega值分成了10类，其p值结果比上一种比较方法更宽松，能检测到更多的正选择基因。使用PAML site model方法能在整体水平上检测基因的正选择位点，而不能表明基因在某个进化分枝上是否受到正选择压。
PAML branch-site model: 主要用于检测基因在某个进化枝上是否存在的正选择位点。该分析方法认为目标分化枝具有一个omega值，其它所有分枝具有一个相同的omega值，然后再检测正选择位点。同样对两种模型进行比较：（1）第一种模型为模型2，将omega值分成<1、=1、>1的三类，这和site model中的一样；（2）第二种模型和前者一致，只是将omega固定成1，作为null model。比较两种模型的似然差异，利用卡方检验（自由度为2）算p值（chi2命令算出的值除以2）。若p值< 0.05，则能通过Bayes Empirical Bayes (BEB)方法计算正选择位点的后验概率，若存在概率值 > 0.95正选择位点，则表示基因在目标分枝上受到正选择压。PAML软件在branch-site模式下，并不给出分枝上的omega值。这表示branch-site模式虽然考虑了目标分枝上具有不同的omega值，但仍然以分析位点上的omega为主。值得注意的是，在branch-site模式下可能检测到正选择位点，但在目标分枝上的omega值仍然可能低于1。可能软件作者基于这点考虑，就没有给出目标分枝上的omega值，以免影响一些人对正选择结果的判断。
PAML branch model: 主要用于检测在某个分枝上，其omega值是否显著高于背景分枝，即基因在目标分枝上进化速度加快。该方法认为基因序列上所有位点的omega值是一致的，对两种模型进行比较：（1）第一种模型为null model，所有分枝具有相同的omega值；（2）第二种模型认为目标分枝具有一个omega值，其它所有分枝具有一个相同的omega值。比较两种模型的似然差异，利用卡方检验（自由度为1）算p值。若p值 <= 0.05，且目标分枝上的omega值高于背景值，则认为该基因为快速进化基因。一般情况下，该方法计算得到的p值会低于第二种方法的结果。

三、其它注意事项

Branch-site model相比于site model的优点是考虑了不同的分枝具有不同的选择压，即具有不同的omega值。该方法让目标分枝具有一个不同的omega值，并没有让所有分枝的omega值独立进行计算（理论上这样是最好的）。这样算法很复杂，程序运行非常非常消耗时间。但其实也没必要这样做，因为正选择分析其实是两条序列比较后，分析dN/dS，再找正选择位点，其分析结果就应该是某个分枝上基因是否受到正选择，在序列那个位点上受到正选择。

若在目标分枝上，其omega值小于1，但是却能找到正选择位点。即该基因在该分枝上的dN/dS < 1，但是在某些位点上，dN/dS > 1。那么该基因是否属于正选择基因？我认为：属于。之所以为正选择基因，主要是因为基因的个别位点或多个位点存在正选择。当只有个别位点受到正选择压时，而其它多个位点存在纯化选择时，可能导致整体上的omega值小于1。此时，该基因也应该是属于正选择基因。

四、参考文献中的正选择分析方法描述

Science文章（https://science.sciencemag.org/content/364/6446/eaav6202）中的正选择基因分析方法：To estimate the lineage-specific evolutionary rate for each branch, the Codeml program in the PAML package (version 4.8) (134) with the free-ratio model (model = 1) was run for each ortholog. Positive selection signals on genes along specific lineages were detected using the optimized branch-site model following the author’s recommendation. A likelihood ratio test (LRT) was conducted to compare a model that allowed sites to be under positive selection on the foreground branch with the null model in which sites could evolve either neutrally and under purifying selection.[ The p values were computed based on Chi-square statistics, and genes with p value less than 0.05 were treated as candidates that underwent positive selection. We identified PSGs at the ancestral branch of Ruminantia (table S22), the ancestral branch of Pecora (table S23), each ancestral family branch of Ruminantia (tableS24), and each ancestral subfamily branch of Bovidae (table S24). We also compared the dN/dS values of Ruminantia families with outgroup mammals (fig. S52).

Science文章（https://science.sciencemag.org/content/364/6446/eaav6202）中快速进化基因分析方法：The branch model in PAML was used, with the null model (model=0) assuming that all branches have been evolving at the same rate and the alternative model (model=2), allowing the foreground branch to evolve under a different rate. An LRT with df =1 was used to discriminate between alternative models for each ortholog in the gene set. Genes with a p value less than 0.05 and a higher ω value for the foreground than the background branches were considered as evolving with a significantly faster rate in the foreground branch.

https://journals.plos.org/plosntds/article?id=10.1371%2Fjournal.pntd.0007463文献中对正选择基因的分析方法：

A calculation of mutational rate ratio ω between two gene sequences was the basis for the positive selection analysis. The ω was calculated as a ratio of nonsynonymous to synonymous mutational rates. The ratio indicates negative purifying selection (0 < ω < 1), neutral evolution (ω = 1), and positive selection (ω > 1) [54]. A set of selected genes from complete genomes was tested relative to positive selection using the maximum likelihood method using the CODEML of the PAML software package [55]. PAML version 4 [56] and its user interface PAMLX [57] were used in our study. For each analyzed gene, its maximum likelihood phylogenetic tree was used as an input tree. The CODEML offers several different codon evolutionary models, and the statistical likelihood ratio test (LRT) was used to compare the codon evolutionary model to the null model. The Bayes empirical Bayes method (BEB) [58] was then used to evaluate the posterior probability of sites considered to have been positively selected.

The CODEML models could produce different results (i.e., a list of sites under positive selection) since they calculate different parameter estimates. Site models allow ω to vary in each site (codon) within the gene. Statistical testing was required for sites with ω > 1. Two pairs of models were predominantly used since their LRTs have low false-positive rates. M1a (nearly neutral evolution) was compared to M2a (positive selection) [58,59] and M7 (beta) was compared to M8 (beta & ω) [60]. Our preliminary testing found that the two model pairs gave the same or very similar results. Therefore we chose to use the M7-M8 model pair. The M7 model is a null model that allows 10 classes of sites with a ω beta-distribution within the interval 0 ≤ ω ≤ 1. Sites with ω > 1 are not allowed. The alternative M8 model adds an eleventh class of sites with ω > 1. Each site was tested to determine the class to which it belongs. The LRT compares twice the log-likelihood difference 2Δl = 2(l1-l0) between the M7 model (log likelihood value l0) and the M8 model (log likelihood value l1) to the χ2 distribution [61]. If the twice log-likelihood difference is above a critical χ2 value, then the null model is rejected, and the positive selection is statistically significant.

A considerable disadvantage of the site models is that ω was calculated as an average over all codons of the site. Therefore, the site models are not suitable for the data where ω also varies between lineages. In contrast, the branch-site models search for positive selection in sites and pre-specifies lineages where different rates of ω may occur [62]. Sequences of lineages are a priori divided into a group of foreground lineages where positive selection may occur and group of background lineages where only purifying selection or neutral evolution occurs. We used branch-site model A, which allows four classes of sites and different setups of foreground lineages to be tested depending on the gene phylogeny. In branch-site model A, all lineages under purifying selection with a low value of ω0 belong to site class 0. Weak purifying selection and neutral evolution with ω1 near to value 1 are allowed in site class 1. In site class 2a, a proportion of class 0 sites in foreground lineages is under positive selection with ω2 > 1. Similarly, site class 2b is a proportion of class 1 sites under positive selection with ω2 > 1. The null model for LRT has ω2 = 1. Critical values of LRT (2Δl) are 2.71 at 5% and 5.41 at 1% [63]. The posterior probabilities of suggested sites under positive selection were calculated using the BEB method.

陈连福的生信博客

第22期培训班将于2024.01.27-2024.02.05期间在武汉市举办！

一、正选择分析的目的：

二、使用PAM对基因进行正选择分析，有三种方法：

三、其它注意事项

四、参考文献中的正选择分析方法描述

发表评论取消回复

一、 正选择分析的目的：

二、使用PAM对基因进行正选择分析，有三种方法：

三、其它注意事项

四、参考文献中的正选择分析方法描述

发表评论 取消回复

一、正选择分析的目的：

发表评论取消回复