Adjust P-values for Multiple Comparisons

发表于2013 年 4 月 27 日由chenlianfu

Reference: Adjust P-values for Multiple Comparisons

Description

Given a set of p-values, returns p-values adjusted using one of several methods.

Usage

p.adjust(p, method = p.adjust.methods, n = length(p))

p.adjust.methods
# c(“holm”, “hochberg”, “hommel”, “bonferroni”, “BH”, “BY”,
# “fdr”, “none”)

Arguments

p numeric vector of p-values
n number of comparisons, must be at least length(p); only set this (to non-default) when you know what you are doing!

Details

The adjustment methods include the Bonferroni correction (“bonferroni”) in which the p-values are multiplied by the number of comparisons. Less conservative corrections are also included by Holm (1979) (“holm”), Hochberg (1988) (“hochberg”), Hommel (1988) (“hommel”), Benjamini & Hochberg (1995) (“BH” or its alias “fdr”), and Benjamini & Yekutieli (2001) (“BY”), respectively. A pass-through option (“none”) is also included. The set of methods are contained in the p.adjust.methods vector for the benefit of methods that need to have the method as an option and pass it on to p.adjust.

The first four methods are designed to give strong control of the family-wise error rate. There seems no reason to use the unmodified Bonferroni correction because it is dominated by Holm’s method, which is also valid under arbitrary assumptions.

Hochberg’s and Hommel’s methods are valid when the hypothesis tests are independent or when they are non-negatively associated (Sarkar, 1998; Sarkar and Chang, 1997). Hommel’s method is more powerful than Hochberg’s, but the difference is usually small and the Hochberg p-values are faster to compute.

The “BH” (aka “fdr”) and “BY” method of Benjamini, Hochberg, and Yekutieli control the false discovery rate, the expected proportion of false discoveries amongst the rejected hypotheses. The false discovery rate is a less stringent condition than the family-wise error rate, so these methods are more powerful than the others.

Note that you can set n larger than length(p) which means the unobserved p-values are assumed to be greater than all the observed p for “bonferroni” and “holm” methods and equal to 1 for the other methods.

Value

A numeric vector of corrected p-values (of the same length as p, with names copied from p).

GFOLD的安装与使用

发表于2013 年 4 月 21 日由chenlianfu

一. GFOLD简介

GFOLD是由同济大学生物信息系张勇教授课题组研发的软件。gfold主要用于RNA-seq数据的差异表达基因分析，特别适合于实验没有重复的情况。

二. GFOLD的安装

GFOLD的安装需要安装gsl-1.15

安装gsl-1.15
$ wget ftp://ftp.gnu.org/gnu/gsl/gsl-1.15.tar.gz
$ tar zxf gsl-1.15.tar.gz
$ cd gsl-1.15
$ ./configure
$ make -j 8
$ sudo make install

安装gfold
$ wget https://bitbucket.org/feeldead/gfold/get/e78560195469.zip
$ unzip e78560195469.zip
$ cd feeldead-gfold-e78560195469
$ make
$ cd ../ ; mv feeldead-gfold-e78560195469 gfold

三. GFOLD的使用

使用gfold，发现结果没有cufflinks得出的结果好。该软件相对简单快速。不推荐使用。

他人blog优秀文章

发表于2013 年 4 月 21 日由chenlianfu

1. EasonCheng——RNA-seq 基本分析流程

只根据两个转录组数据分析差异表达基因

发表于2013 年 4 月 19 日由chenlianfu

1. 数据

手头有两个转录组数据，分别是某一真菌物种(该物种没有基因组数据)的双核菌丝阶段和子实体阶段的转录组数据。测序平台是Illumina Hiseq2000；插入片段长度200bp，测序的reads长度90bp。得到的数据文件为：

/home/user/RNA-seq/mycelium_reads1.fastq
/home/user/RNA-seq/mycelium_reads2.fastq
/home/user/RNA-seq/fruitingbody_reads1.fastq
/home/user/RNA-seq/fruitingbody_reads2.fastq

2. 数据的预处理

2.1. 去除N的比例大于5%的reads;去除低质量reads（质量值Q≤20的碱基数占整个read的50％以上）；
2.2. 根据FastQC对上述过滤后的reads的质量检测，去除reads首尾各10bp的碱基，得到的预处理数据为：

/home/user/RNA-seq/clean_reads/mycelium_reads1.fastq
/home/user/RNA-seq/clean_reads/mycelium_reads2.fastq
/home/user/RNA-seq/clean_reads/fruitingbody_reads1.fastq
/home/user/RNA-seq/clean_reads/fruitingbody_reads2.fastq

3. 使用Trinity和TGICL进行转录组的组装

3.1. 对mycelium数据和fruitingbody数据分别进行转录组的组装

$ pwd
/home/user/RNA-seq/

$ mkdir assembly
$ cd assembly

$ Trinity.pl --jaccard_clip --seqType fq --JM 50G --SS_lib_type FR --CPU 24 --inchworm_cpu 24 --bflyCPU 24 --group_pairs_distance 500 -min_contig_length 200 --output mycelium_contig --left /home/user/RNA-seq/clean_reads/mycelium_reads1.fastq --right /home/user/RNA-seq/clean_reads/mycelium_reads2.fastq
$ Trinity.pl --jaccard_clip --seqType fq --JM 50G --SS_lib_type FR --CPU 24 --inchworm_cpu 24 --bflyCPU 24 --group_pairs_distance 500 -min_contig_length 200 --output fruitingbody_contig --left /home/user/RNA-seq/clean_reads/fruitingbody_reads1.fastq --right /home/user/RNA-seq/clean_reads/fruitingbody_reads2.fastq

当然，需要将生成的两个转录组的Trinity.fasta序列按长度进行排序；统一更改的fasta
头名称；更改fasta文件名

$ cp mycelium_contig/Trinity.fasta mycelium_contigs.fasta
$ cp fruitingbody_contig/Trinity.fasta fruitingbody_contigs.fasta

3.2. 使用TGICL将两个转录组序列合并

$ pwd
/home/user/RNA-seq/assembly

$ mkdir all_tissue_contig
$ cd all_tissue_contig

$ cat ../mycelium_contigs.fasta ../fruitingbody_contigs.fasta > all.contigs.fasta
$ tgicl -F all.contigs.fasta

tgicl生成了两个有用的文件： asm_1/contigs  和 all.contigs.fasta.single
tons。其中前者是聚类后的contigs结果；后者是没有聚类的单独的contigs的序列名，需
要分别到../mycelium_contigs.fasta 和 ../fruitingbody_contigs.fasta文
件中提取出相应的序列: all.contigs.fasta.singletons.mycelium.fasta 和 
all.contigs.fasta.singletons.fruitingbody.fasta

$ cat asm_1/contigs all.contigs.fasta.singletons.mycelium.fasta all.contigs.fasta.singletons.fruitingbody.fasta > all_contigs.fasta

在对all_contigs.fasta进行fasta头的重命名，并将序列按长度排序；同时要得到all_
contigs.fasta中的序列和mycelium_contigs.fasta，fruitingbody_contigs.
fasta中序列的对应关系。

$ cp all_contigs.fasta ../

3.3. 至此得出转录组的组装结果：

/home/user/RNA-seq/assembly/all_contigs.fasta

4. 使用cufflinks来分析差异表达基因

4.1 使用tophat来将预处理后的reads比对到转录组序列上

$ pwd
/home/user/RNA-seq/

$ mkdir cufflinks
$ cd cufflinks

$ bowtie2-build ../all_contigs.fasta all_contigs

$ tophat -o tophat_out_mycelium -r 60 --mate-std-dev 80 -p 24 --library-type fr-unstranded all.contigs /home/user/RNA-seq/clean_reads/mycelium_reads1.fastq /home/user/RNA-seq/clean_reads/mycelium_reads2.fastq
$ tophat -o tophat_out_fruitingbody -r 60 --mate-std-dev 80 -p 24 --library-type fr-unstranded all.contigs /home/user/RNA-seq/clean_reads/fruitingbody_reads1.fastq /home/user/RNA-seq/clean_reads/fruitingbody_reads2.fastq

4.2 自制一个transcripts.gtf文件
由于cuffdiff运行需要一个参考序列的transcripts.gtf文件: 该文件有9列，使用tab分隔；使exon的范围为整个contig。其格式如：

AA.auricula_all_contig_1     chenlianfu  exon  1  9401  .  .  .  gene_id "A.auricula_all_contig_1"; transcript_id "A.auricula_all_contig_1";
A.auricula_all_contig_10     chenlianfu  exon  1  4464  .  .  .  gene_id "A.auricula_all_contig_10"; transcript_id "A.auricula_all_contig_10";
A.auricula_all_contig_100    chenlianfu  exon  1  3090  .  .  .  gene_id "A.auricula_all_contig_100"; transcript_id "A.auricula_all_contig_100";
A.auricula_all_contig_1000   chenlianfu  exon  1  1768  .  .  .  gene_id "A.auricula_all_contig_1000"; transcript_id "A.auricula_all_contig_1000";
A.auricula_all_contig_10000  chenlianfu  exon  1  586   .  .  .  gene_id "A.auricula_all_contig_10000"; transcript_id "A.auricula_all_contig_10000"; 
A.auricula_all_contig_10001  chenlianfu  exon  1  586   .  .  .  gene_id "A.auricula_all_contig_10001"; transcript_id "A.auricula_all_contig_10001";

4.3 使用cuffdiff来分析转录子的表达量和差异表达基因

$ pwd
/home/user/RNA-seq/cufflinks

$ cuffdiff -L mycelium,fruitingbody --library-type fr-unstranded -p 8 -o cuffdiff ./transcriptome.gtf ./tophat_out_mycelium/accepted_hits.bam ./tophat_out_fruitingbody/accepted_hits.bam

至此得出转录子的表达量数据和差异表达分析

/home/user/RNA-seq/cufflinks/cuffdiff/gene_exp.diff

基因组组装软件心得

发表于2013 年 4 月 18 日由chenlianfu

1. SOAPdenovo 和 Velvet组装基因组是否需要对reads进行预处理？

使用soapdenovo组装基因组，需要先对reads进行预处理，去掉质量不好的reads，剪掉3’端和5’端的一些碱基；然后再进行组装。而使用velvet时候，则不需要，可能是由于velvet的cutoff比较好。
这个结论得出方法：将reads进行预处理前后分别用于组装出基因组，然后再使用bowtie2将两种reads比对到两种基因组组装结果，观察reads比对上的百分比得出。reads的预处理方法是去除低质量reads并将100bp的reads前后各减去5bp碱基。
对于soapdenovo:

                      未处理reads比对率         预处理reads比对率
未处理reads基因组          65.84%                   66.10%
预处理reads基因组          66.68%                   66.99%

对于velvet:

                      未处理reads比对率         预处理reads比对率
未处理reads基因组          86.41%                   88.10%
预处理reads基因组          70.09%                   71.62%

通过以上的对比和数据可以看出：1.velvet组装出的基因组的精确性要比soapdenovo高；2.对reads进行预处理对soapdenovo组装的基因组的精确性影响较小，预处理reads后，组装出的基因组精确性稍有提高；3.使用velvet组装基因组时，不对reads进行预处理组装出的基因组精确性高很多。
讨论：1.数据量不够多，碱基覆盖度只有20~25X, 以上结论具有一定的片面性; 2.velvet在对kmer进行cutoff的算法很棒，因而组装出的基因组精确性高。

Velvet的安装和使用

发表于2013 年 4 月 16 日由chenlianfu

一. Velvet简介

Velvet 是 EBI 的 Daniel Zerbino and Ewan Birney 开发的利用short reads组装基因组的软件。
Velvet作为一种de novo基因组组装软件，可利用的reads包括 Illumina reads 和 454 reads。
引用文献(2008)：Velvet: algorithms for de novo short read assembly using de Bruijn graphs.

二. Velvet下载与安装

make ‘CATEGORIES=57’ ‘MAXKMERLENGTH=75’ ‘BIGASSEMBLY=1’ ‘LONGSEQUENCES=1’ ‘OPENMP=1’ ‘BUNDLEDZLIB=1’

linux下ISO文件的制作,刻录和挂载

发表于2013 年 4 月 16 日由chenlianfu

1. ISO文件的刻录

手头上有iso文件，比如CentOS-6.4-x86_64-bin-DVD1.iso，需要刻录成光碟，然后用于装系统用。使用 cdrecord 命令。

$ cdrecord dev=/dev/cdrw CentOS-6.4-x86_64-bin-DVD1.iso

常用参数：
-v | -verbose
speed=<int>
    指定刻录机速度，一般不用，因为cdrecord会自动检测最佳刻录速度
dev=<target>
    光驱的位置

2. ISO文件的制作

制作ISO文件可以是用 dd 或 mkisofs 命令。

2.1 dd用法

当直接从光驱中压制ISO文件时：

$ dd if=/dev/cdrw of=output.iso bs=1024

2.2 mkisofs用法

当将一个文件或目录压制到ISO文件时：

$ mkisofs -r -o output.iso folder_name/

3. ISO文件的挂载

$ sudo mount -o loop file.iso /mnt

Tophat的安装与使用

发表于2013 年 4 月 15 日由chenlianfu

一. Tophat简介

Tophat使用RNA-seq的reads数据来寻找基因的剪切点(splice junction）。该软件调用Bowtie,或Bowtie2来将reads比对到参考基因组上，分析比对结果，从而寻找出外显子之间的结合位点。

二. Tophat安装

直接下载适合于Linux x86_64的二进制文件，解压缩即可使用。

$ wget http://tophat.cbcb.umd.edu/downloads/tophat-2.0.8b.Linux_x86_64.tar.gz
$ tar zxf tophat-2.0.8b.Linux_x86_64.tar.gz

前提条件当然要安装Bowtie, Bowtie2, SAM tools, Boost C++ libraries等。

三. Tophat的使用参数

使用Tohat时，bowtie2（或bowtie，下同）, bowtie2-align, bowtie2-inspect, bowtie2-build 和 samtools 必须要在系统路径中。

1. 用法

$ tophat [options]* <index_base> <reads1_1[,...,readsN_1]> <reads1_2[,...,readsN_2]>
可以看出，tophat必须要的条件是比对的index数据库，以及要比对的reads。可以为多个
paired-end reads数据以逗号分开。

值得注意的：Tophat能比对的最大reads为1024bp；能比对paired-end reads；不能将多种不同类型的reads混合起来进行比对，这样会给出不好的结果。

如果有多种不同类型的reads进行比对，则可以：

首先，对一种类型的reads使用合适的参数运行tophat；
接着，使用bed_to_juncs将前一次的运行结果junctions.bed转换成下一次运行tophat
所的-j参数所需的junction文件；
最后，再一次使用-j参数运行tophat。

2. 常用一般参数

-h | --help
-v | --version

-N | --read-mismatches  default: 2
    丢弃不匹配碱基数超过该数目的比对结果

--read-gap-length  default: 2
    丢弃gap总长度超过该数目的比对结果

--read-edit-dist  default: 2
    丢弃read的edit distance大于该值的比对结果

--read-realign-edit-dist <int> default: "read-edit-dist" + 1
    一些跨越多个exons的reads可能会被错误地比对到geneome上。Tophat有多个比对
步骤，每个比对步骤过后，比对结果中包含了edit distance的值。该参数能让Tophat对
那些edit distance的值 >= 该参数的reads重新进行比对。若设置该参数值为0，则每个
read在多个比对步骤中每次都要进行比对。这样会加大地增加比对精确性和运行时间。默认下
该参数比上一个参数的值大，则表示对reads进行重新比对。

--bowtie1  default: bowtie2
    使用Bowtie1来代替Bowtie2进行比对。特别是使用colorspace reads时，因为只
有Bowtie1支持，而Bowtie2不支持。

-o | --output <string>  default: ./tophat_out
    输出的文件夹路径

-r | --mate-inner-dist <int>  default: 50
    成对的reads之间的平均inner距离。例如：fragments长度300bp，reads长度50bp
， 则其inner距离为200bp，该值该设为200。

--mate-std-dev <int>  default:20
    inner距离的标准偏差。

-a | --min-anchor-length <int>  default: 8
    read的锚定长度：该参数能设定的最小值为3；锚定在junction两边的reads长度只
有都大于此值，才能用于junction的验证。

-m | --splice-mismatches <int>  default: 0
    对于一个剪切比对，其在锚定区能出现的最大的不匹配碱基数。

-i | --min-intron-length <int>  default:70
    最小的intron长度。Tophat会忽略比该长度要小的donor/acceptor pairs，认
为该区属于exon。

--I | --max-intron-length <int>  default:500000
    最大的intron长度。Tophat会忽略长度大于该值的donor/acceptor pairs，除
非有long read支持。

--max-insertion-length <int>  defautl: 3
    最大的插入长度

--max-deletion-length <int>  default: 3
    最大的缺失长度

--solexa-quals
    fastq文件使用Solexa的碱基质量格式

--solexa1.3-quals | --phred64-quals
    使用Illumina GA pipeline version 1.3的碱基质量格式，即Phred64.

-Q | --quals
    说明是使用单独的碱基质量文件

--inter-quals
    有空格隔开的整数值来代表碱基质量。当使用 -C 参数时，该参数为默认参数。

-C | --color
    Colorspace reads。使用这一种reads的时候命令如下：
$ tophat --color --quals --bowtie1 [other options]* <colorspac
e_index_base> <reads1_1[,...,readsN_1]> <reads2_1[,...,readN_2]>
 <quals1_1[,...,qualsN_1]> <quals1_2[,...,qualsN_2]>

-p | --num-threads <int>  default: 1
    比对reads的线程数

-g | --max-multihits <int>  default: 20
    对于一个reads，可能会有多个比对结果，但tophat根据比对得分，最多保留的比对结
果数目。如果没有 --report-secondary-alignments 参数，则只会报告出最佳的比对
结果。若最佳比对结果数目超过该参数值，则只随机报告出该数目的最佳比对结果；若有 --
report-secondary-alignments 参数，则按得分顺序报告出比对结果，直至达到默认
的数目为止。

--report-secondary-alignments
    是否报告additional or secondary alignments（基于比对分值AS来确定的）。

--no-discordant
    对于paired reads，仅仅报告concordant mappings。

--no-mixed
    对于paired reads，只报告concordant mappings 和 discordant mappi
ngs。默认上，是所有的比对结果都报告。

--no-coverage-search
    取消以覆盖度为基础来搜寻junctions，和下一个参数对立，该参数为默认参数。
--coverage-search
    确定以覆盖度为基础来搜寻junctions。该参数能增大敏感性。

--microexon-search
    使用该参数，pipeline会尝试寻找micro-exons。仅仅在reads长度>=50bp时有效。

--library-type
    Tophat处理的reads具有链特异性。比对结果中将会有个XS标签。一般Illumina数
据的library-type为 fr-unstranded。

3. 高级参数

 --keep-tmp
    保留中间文件和临时文件，对于debug有用

--keep-fasta-order
    对比对结果按基因组fasta文件进行排序。该参数会使输出的SAM/BAM文件和tophat的
1.41或以前版本不兼容

--no-sort-bam
    输出的BAM文件不是coordinate-sorted.

--no-convert-bam
    不要转换成bam格式。输出结果为sam格式。

-R | --resume <string>
    从最末尾的成功完成点处，接着运行Tophat。使用方法为：
$ tophat -R tophat_out

-z | --zpacker  default:gzip
    用来对临时文件进行压缩的的压缩程序

4. Bowtie2的特别参数

使用tophat2的时候，其中的一些参数传递为bowtie2的参数，这些参数都以’b2’开头。其实，这些参数使用默认的即可。

end-to-end模式(Tophat2不能使用local alignment):
--b2-very-fast
--b2-fast
--b2-sensitive
--b2-very-sensitive

比对参数：
--b2-N  default: 0
--b2-L  default: 20
--b2-i  default: S,1,1.25
--b2-n-ceil  default: L,0,0.15
--b2-gbar  defaut: 4

得分参数：
--b2-mp  default: 6,2
--b2-np  default: 1
--b2-rdg  default: 5,3
--b2-rfg  default: 5,3
--b2-score-min  default: L,-0.6,-0.6

Effort参数
--b2-D  default: 15
--b2-R  default: 2

5. 融合转录子mapping

如果设定 –fusion-search 参数，则有些reads能比对到潜在的融合转录子(fusion transcripts)上。额外融合信息保存在 fusions.out 中。

--fusion-search 
    开启融合转录子的比对
--fusion-anchor-length  default: 20
    read比对到融合子的两边，每以边至少匹配的碱基数。

6. 提供的转录子的结构注释数据

值得注意的提供的GTF文中中的染色体名称和Bowtie index中的一致。这些名称是区分大小写的。

-j | --raw-juncs <.juncs file>
    提供junctions文件。该文件可以使用tophat同一目录下的程序bed_to_juncs程序
来处理tophat的结果文件junctions.bed生成。
$ bed_to_juncs <junctions.bed>
    junctions文件是tab分隔的文件，内容为：
<chrom> <leftgt> <rigth> <+/->
其中left和right数值是0-based的junction两端的值。

--no-novel-juncs
只搜寻和GFF或junctions文件中提供的junctions想匹配的reads。如果没有 -G 或 -j
 参数，则该参数无效。

-G | --GTF <GTF/GFF3 file>
提供基因模型的注释文件，GTF 2.2 或者 GFF 3 格式的文件。如果设置了该参数，Tophat
则先提取出转录子序列，然后使用Bowtie2将reads比对到提取的转录组中；只有不能比对上
的reads再比对到genome；比对上的reads再打断转变成genomic mappings；再融合新
的mappings和junctions作为最后的输出。
值得注意的是GTF/GFF文件代表chromosome和contig的第一列要和bowtie index中的
参考序列名一致。 `$ bowtie-inspect --names your_index` 命令可以获得bowt
ie的index。

--transcriptome-index <dir/prefix>
是使用了 -G 参数后，Tophat提取转录子序列，然后使用bowtie2-build来建立index，
这个过程会消耗不少时间。于是，使用该参数，会将index文件生成到指定文件夹。则后续的
运用同样的index则不再需要额外耗时了。

7. 提供insertions/deletions

以下参数是使用RNA-seq数据来验证indels。

--insertions | --deletions <.juncs file>
    juncs文件例子：
chr1 20564 20567
0-based数值。表示有20565和20566这2个碱基缺失
chr1 17491 17491 CA
表示在17491处插入了2个碱基CA

--no-novel-indels
    仅仅只搜寻在已给的位点的reads。

四. Tophat的输出结果

主要的结果文件是：
1. accepted_hits.bam
2. junctions.bed UCSC BED格式
3. insertions.bed 和 deletions.bed

五. 思考题

对一物种的两个样本A和B使用Illumina Hiseq2000分别进行了转录组测序，得到了结果文件A.reads1_1.fastq, A.reads1_2.fastq, B.reads1_1.fastq 和 B.reads1_2fastq。测序文库的插入片段长度为200bp，reads长度为90bp。物种的基因组文件为species.fasta。请用Tophat分析该物种的转录结合位点，indel信息?

$ bowtie2-build species.fastq species
$ tophat \
  --read-realign-edit-dist 0 \
  -o ./tophat_out \
  -r 20 \
  --mate-std-dev 20 \
  --coverage-search \
  --microexon-search \
  -p 24 \
  --library-type fr-unstranded \
  species \
  A.reads1_1.fastq,B.reads1_1.fastq A.reads1_2.fastq B.reads1_2fastq

myrna本地化的安装与使用

发表于2013 年 4 月 11 日由chenlianfu

一. Myrna的简介

Myrna主要用途是，使用RNA-seq数据来计算基因差异表达。
Myrna的运行步骤是：

1. 将reads进行预处理; 
2. 使用Bowtie将reads比对到参考序列上; 
3. 使用R/Bioconductor将比对结果定位到genes上; 
4. 标准化基因的比对数; 
5. 计算差异表达的统计数据; 
6. 将结果绘制出图。

Myrna有三种使用方式:

1. 在云端使用，Amazon's Elastic MapReduce service； 
2. 在 hadoop 集群上使用; 
3. 在单独的一台计算机上使用。

参考文献(2010)：Cloud-scale RNA-sequencing differential expression analysis with Myrna
只有读该文献才能全面了解Myrna.

二. Myrna的安装

1. 下载Myrna并安装

cd /home/chenlianfu/programs/
$ wget http://nchc.dl.sourceforge.net/project/bowtie-bio/myrna/1.2.0/myrna-1.2.0.zip
$ unzip myrna-1.2.0.zip

设置Myrna的环境变量MYRNA_HOME:
$ vim ~/.bashrc
export MYRNA_HOME=/home/chenlianfu/programs/myrna-1.2.0

2. 安装bowtie

安装完bowtie，再将其执行程序bin目录添加到环境变量MYRNA_BOWTIE_HOME。或者在运行程序时候使用参数 –bowtie

设置Myrna的环境变量MYRNA_BOWTIE_HOME:
$ vim ~/.bashrc
export MYRNA_BOWTIE_HOME=/home/chenlianfu/programs/bowtie-1.0.0/

3. 安装R，Bioconductor及必须的包

本软件(版本1.20)的使用必须使用R2.14，高版本的R还不兼容。故使用Myrna自带的一个脚本来下载R2.14并安装。

# R

安装R包 lmtest 和 multicore:
> install.packages("lmtest")
> install.packages("multicore")

安装Bioconductor:
> source("http://bioconductor.org/biocLite.R")
> biocLite()

安装Bioconductor包 IRanges,geneplotter,BiocGenerics 和 biomaRt:
> source("http://bioconductor.org/biocLite.R")
> biocLite("IRanges")
> biocLite("geneplotter")
> biocLite("BiocGenerics")
> biocLite("biomaRt")

设置Myrna的环境变量MYRNA_RHOME(包含bin/R和bin/Rscript):
vim ~/.bashrc
export MYRNA_RHOME=/usr/local/

4. 安装SRA toolkit

如果输入Myrna的软件是.sra格式，则需要SRA toolkit中的程序 fastq-dump 了。安装玩 SRA toolkit 后，再设置Myrna的环境变量：

$ vim ~/.bashrc
export MYRNA_SRATOOLKIT_HOME=/home/chenlianfu/programs/sratoolkit.2.3.1-centos_linux64/bin/

5. 测试本地化运行程序

使用本地化运行程序myrna_local来进行测试，通过表示安装成功。

$ $MYRNA_HOME/myrna_local --test
...
PASSED install test

6. 设置的一些环境变量

$ vim ~/.bashrc
export MYRNA_BOWTIE_HOME=/home/chenlianfu/programs/bowtie-1.0.0/
export MYRNA_HOME=/home/chenlianfu/programs/myrna-1.2.0
export MYRNA_RHOME=/usr/local/
export MYRNA_SRATOOLKIT_HOME=/home/chenlianfu/programs/sratoolkit.2.3.1-centos_linux64/bin/
export MYRNA_REFS=/home/chenlianfu/programs/myrna-1.2.0/reftools/

三. Myrna的使用

1. 本地化运行程序的使用参数：

--reference <path>
    参考序列的文件夹。该文件夹由一个参考序列jar文件使用unzip解压缩出来的，里面主
要包括两个子文件夹：1."index":里面包含bowtie-bulid产生的index文件；2."ival
s":里面包含gene的注释信息。

--input <path>
    输入的文件或文件夹路径。如果不指定--preprocess 参数，则该参数后接的是一个
文件夹，该文件夹包含了reads的预处理结果，即 --preprocess-output 的输出结果；
如果指定了 --preprocess 参数，则是一个文本文件，内容为输入的序列文件的一个清单
(Labeled manifest files)。该文件描述了输入序列文件和样本的信息。序列文件是
FASTQ或sra格式文件，FASTQ格式可以是gzip或bzip2压缩。该文件每一行代表一个测序
的结果，是以下是 unpaired reads 和 paired reads 的书写方法：
<URL>(tab)<Optional MD5>(tab)<Sample label>
<URL 1>(tab)<Optional MD5 1>(tab)<URL 2>(tab)<Optional MD5 2>(tab)<Sample label>

其中URL是测序结果文件的路径，可以是ftp站点上的文件；MD5是可选的，以利于下载文件后
用于文件的完整度检验，若不想进行检验，则设置为0；<Sample label>的写法为：
<Group ID>-<BioRep ID>-<TechRep ID>

其中，Group ID 是样品的名称；BioRep ID是生物学重复的代号；TechRep ID是技术性
重复的代号。比如：
Tumor-Subject1-Lane1
Tumor-Subject1-Lane2
Tumor-Subject2-Lane1
Tumor-Subject2-Lane2
Normal-Subject1-Lane1
Normal-Subject1-Lane2
Normal-Subject2-Lane1
Normal-Subject2-Lane2

--output <path>
    输出文件的路径。如果 --just-preprocess 参数给定，则只是对reads文件进行预
处理，给出结果。默认情况下，则是输出最终结果：一系列的表格，图和比对文件。

--intermediate <path>  default: /tmp/myrna/intermediate/<PID>
    中间结果文件存放的临时路径。如果指定了 --keep-intermediates 或 --keep-
all参数，则该文件则会一直存在。默认情况下会使用完毕后删除。

--preprocess-output <path>
    对reads进行预处理的输出路径。由于对reads的预处理会消耗很多时间，因此保留数据
的预处理结果，对再次运行程序来说，可以节约很多时间。

--keep-intermediates <path>
    保留程序运行中生成的结果文件夹和文件，默认情况下这些文件会被尽快的自动删除。

--keep-all
    保留所有的临时文件，默认情况下这些文件会被尽快的自动删除。

--cpus <int>  default: 1
   设定程序运行使用的CPU线程数。

--bowtie <path>
    设定Bowtie的路径，以运用bowtie进行比对的步骤，该参数优先于环境变量MYRNA_
BOWTIE_HOME，$MYRNA_HOME/bin的子目录，以及 PATH。

--fastq-dump <path>
    设定 SRA toolkit 中的 fastq-dump 程序的位置，优先于PAHT。

--Rhome <path>
    设定R和R/Bioconductor安装程序的位置，该目录包括 bin/R 和 bin/Rscript,
优先于环境变量MYRNA_RHOME和PATH。

2. emr, hadhoop 和 local运行的共同参数

--quality { phred33 | phred64 | solexa64 }  default: phred33
    设置输入reads的碱基质量格式。phred33:输入的碱基质量等于ASCII码值加上33;最
低碱基质量是“#”. phred64:输入的碱基质量等于ASCII码值加上64;最低碱基质量是“B”. 

--preprocess
    当 --input 的参数是一个文本文件（含有输入序列信息的清单）时，则必须加入该参数
；该参数指定出了文本文件中的序列信息来，Myrna则进行这些reads的预处理，将生成的预处
理reads结果输出到 --preprocess-output 指定的文件夹；如果不指定--preprocess
-output参数，则输出到 --intermediate 指定的文件夹。

--just-preprocess
    和 --preprocess 参数类似，但不同的是进行reads预处理后，即停止pipeline的
运行，并将结果输出到 --output指定的文件夹。

--just-align
    不将Myrna的pipeline运行到底，只是运行到比对结束，并将结果存放到 --output 
参数指定的URL.

--resume-align
    使用此命令来运行上一个命令（比对完毕）过后的阶段。 此时 --input 的参数则是上
一个命令中 --output的URL.

--bowtie-args "<args>"  default: -m 1
    设置bowtie的参数。

--discard-reads <fraction>  default: 0.0
    丢弃一定比例输的reads来进行程序的运行。比如 0.5 则代表丢弃50%的reads。该参
数应用于所有输入的reads。对于程序的调试比较有用。

--influence <int>  default: 1
    一个read的"influence"长度。和下面3个参数有关系。

--from3prime
    设定修剪过后的reads从3'端开始衡量其"influence"。此参数是默认设置，和下2个
参数是冲突的，三选一。

--from5prime
    设定修剪过后的reads从5'端开始衡量其"influence"。

--from-middle
    设定修剪过后的reads从中间开始衡量其"influence"。

--top <int>  default:50
    将genes按p值从小到大排序，Myrna将报告出前一定数目的genes的比对结果。默认是
50个genes。

--family { poisson | gaussian }  default: poisson
    设定检测gene的差异表达的统计方法。possion适合于样本<=10个；gaussian适合于
样本数>10。当然这不是硬性规定，具体还要用实验验证。当 --nulls 参数是 0 或者 没有
给出时，则使用此参数中的模型来作参数检验(parameric statistical model)，计算
出p-value.当 --nulls参数大于0，则使用permutation test来计算p-value.

--normalize { total | median | upper-quartile }
    Each count (or pooled count) is "normalized" according to a 
per-label baseline normalization factor.

--gene-footprint { union | intersect }  default: intersect
    计算比对和genges的关联时所使用的 gene footprint。intersect 和 union 
分别对应这reference文件夹中的ivals子文件夹中的ui和un文件夹；其中un对应的基因区
要比ui多，并且un的基因区包含了ui的，个人推测可能是un包含了5'端和3'端非翻译区。
union: 将一个比对结果指向某一个gene的必要条件是，read的"influence"(默认下是
从3'端开始)有一个碱基和一个gene任意的(any)transcripts重叠；intersect: 将
一个比对结果指向某一个gene的必要条件是，read的"influence"(默认下是从3'端)有一
个碱基和一个gene所有的(all)的transcripts重叠。我个人的理解，通俗来讲：以上数个
参数的目的是计算比对到gene上的reads的数目。根据比对结果，默认将reads从3'端开始
计算，如果比对上reads有一个碱基和一个gene的一个外显子重叠，则在union模式下这个
gene则可以得到一个read比对结果；而在intersect模式下，这个reads必须和该gene所
有的外显子有一个碱基的重叠，则该gene才能得到一个read比对结果。可见intersect要求
read贯穿整个gene才行能得到比对到gene的数目；因此，对于short reads进行基因表达
量计算的时候，则需要选union模式，不能选默认的模式。

--paired-ttest
    如果设定此值，则表示： 1, 是对2个样作比较； 2, 这2个样的生物学重复必须要匹
配。当该参数使用时，将对两个样作t检验。

--nulls <int>
    Set the number of null statistics generated per gene to <int>.
 A null statistic for a gene is calculated by permuting the sample
 labels for the counts and normalization factors, then re-calcula-
-ting the statistical test. 如果<int>是一个非0的数，则p-values由permu-
-tation test得出；否则，p-values由参数检验模型(parametric statistical 
model)计算出。

--truncate <int>
    将所有的reads都从3'端开始修剪，直到该参数设定的长度；小于该长度的reads依然
保留。

--truncate-discard <int>
    将所有的reads都从3'端开始修剪，直到该参数设定的长度；而小于该长度的reads全
部剔除。

--ditch-alignments
    不输出比对结果和图形结果；依然报告差异表达基因，p-values和相关的图形结果。

--discard-mate <int>
    对于 paired-end reads，去掉其中一端的序列( 1 或 2 )再来进行比对。这对于
一些数据集的比较有用，比如数据集中同时有 paired reads 和 unpaired reads时。

--pool-reps
    将所有的重复，包括生物学重复和技术性重复融合后，再进行计算统计。

--pool-tech-reps
    将技术性重复进行融合后，在进行计算统计。

--test
    搜索Myrna需要的工具(Bowtie, R/Bioconductor 和 fastq-dump），来检测
Myrna能否正确运行。

--tmpdir `<int>`  default: /tmp/Myrna/invoke.scripts
    临时文件存放的位置，是一个动态生成的脚本。

四. 个人感受

使用时，生成的结果没有rpkm值，p-value值及预期的图表结果。不知到哪儿出错了。比对的之前的过程都是正常的。同时使用sra文件时候，提示fastq-dump找不到或无法执行(实际上在–test的时候可行)。无法正常使用myrna。费心费力，不想深究了…悲剧…再去研究cufflinks去。

使用代理上国外网站

发表于2013 年 4 月 10 日由chenlianfu

经常的，会遇到重要的国外网站上不去。比如著名的bowtie软件，在国内就上不去该网站了。或者google时常抽风。欲解决这些问题，则要开启一些代理来正常访问国外的网站，以下是具体的方法：

1. 从一些网站搜索大量的代理IP

从代理IP网来得到一些代理IP。此时打开该网站主页，选择国外代理页面。

$ wget http://www.dlipw.com/a/guowaidailiip/2013/0330/4269.html
下载了该网页的所有内容
$ grep -P "\d+\.\d+\.\d+\.\d+:\d+" 4269.html > IPs
通过perl的正则表达式提取出含ip地址的行
$ perl -p -i -e 's/^.*(\d+\.\d+\.\d+\.\d+:\d+).*$/$1/' IPs
继续将ip地址和端口提取出来，每行一个ip地址和端口信息

2. 代理猎手软件Proxy Hunter来验证IP

代理猎手软件解压缩后，在windows下打开直接使用。代理猎手下载网页：http://www.orsoon.com/soft/down3555.html。其使用方法：

验证数据设置：
选择系统菜单——参数设置——验证数据设置——添加——验证地址输入：http://bowtie-bio.s
ourceforge.net/index.shtml——特征字串输入：ultrafast——确定。
导入并验证IP：
点击搜索结果——导入结果，通过浏览来打开上一步骤中的文件——验证全部——Free的结果即为我
们需要的代理IP了。

使用以上方法搜索出的几个结果：

2.133.93.50:9090
2.135.237.92:9090
2.135.237.98:9090
2.181.177.7:8080
2.183.155.2:8082
2.184.30.2:8080
2.184.31.2:8080
5.8.242.11:8080
2.184.30.2:8080
2.183.155.2:8082
5.199.166.250:3128
1.34.181.22:110

3. firefox方便地代理上网

安装附加组件AutoProxy方便地代理上网，以免自己手动切换的麻烦。该组件设置一些规则，指定一些网站代理上网，其它网张正常上网。

4. 一些在线网页版的代理

http://proxyie.cn/

2025年 7月
一	二	三	四	五	六	日
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31