3′ tag 数字基因表达谱的分析方法

1. 文献报道

An efficient approach to finding Siraitia grosvenorii triterpene biosynthetic genes by RNA-seq and digital gene expression analysis 报道:
Prior to mapping reads to the reference database, we filtered all sequences to remove adaptor sequence, low quality sequences (tags with unknown sequences ‘N’), empty tags (sequence with only adaptor sequences but no tags); low complexity, and tags with a copy number of 1 (probably sequencing error). A preprocessed database of all possible CATG+17 nucleotide tag sequences was created using our transcriptome reference database. For annotation, all tags were mapped to the reference sequences and only allowed 1 or fewer nucleotide mismatches. All the tags mapped to reference sequences from multiple genes were filtered and the remaining tags were designed as unambiguous tags. For gene expression analysis, the number of expressed tags was calculated and then normalized to TPM (number of transcripts per million tags); and the differentially expressed tags were used for mapping and annotation.

3′ tag digital gene expression profiling of human brain and universal reference RNA using Illumina Genome Analyzer 报道:
Illumina Pipeline Software version 1.0 was used for off-instrument data processing. Images from every sequencing cycle were converted to signal intensities using Illumina Pipeline’s FireCrest v.1.9.5. Next, Bustard v.1.9.5 was run to perform base calling using the intensity values and calculate quality scores for every base. The 16-base long reads (excluding the 4-base DpnII recognition site) were aligned to DpnII tag tables generated by Stowers Institute http://research.stowers-institute.org/microarray/tag_tables/index.html webcite using megaBLAST with word size of 12 and low-complexity region filtering turned off. Only reads that perfectly matched to tag tables without mis-matches and gaps were considered. From this set, reads that could be aligned to the Stowers’ repeat tag table were excluded (the repeat tag table contains any reads aligned to ≥ 2 locations, unless all locations are from the same gene). The remaining reads were aligned to the combination of canonical (exonic and splice junction tags from protein-coding transcripts), mitochondrial (tags from any mitochondrion-associated transcripts encoded by both genomic and mitochondrial DNA), and ribosomal (tags from rRNA or tRNA) tag tables. Reads mapping on genes with multiple homologous family members were excluded from our analysis. When there were multiple types of tags aligned to different locations of the same gene, the gene expression levels are represented by the summation of all.

Digital gene expression analysis of two life cycle stages of the human-infective parasite, Trypanosoma brucei gambiense reveals differentially expressed clusters of co-regulated genes 报道:
All tags were mapped to the in silico generated transcriptome of T. b. brucei TREU 927[35], the most closely related fully annotated genome available to the T. b. gambiense strain, using MAQ program maq-0.6.8_x86_64-linux[65], allowing for a 2 bp mismatch between the tag and the reference transcriptome. The in silico transcriptome did not contain 5′ or 3’UTR sequences as these have not been defined in T. brucei. Tags that were generated with a poor quality sequencing score were removed from the analysis. A mapping quality score of 40, incorporating sequence quality and ability of the tag to map to one unique site in the transcriptome, was used to identify tags that align uniquely to the reference sequence. The aligned tags will be available in TritrypDB[35]. This study was limited to tags that map to open reading frames only and does not show tags that map to mRNA with long 3’UTRs.

2. 方法总结

通过文献中的方法,分析 3′ tag 方法的几点注意事项:

1. 对 tag 数据进行预处理。去掉以下序列:含有 adaptor 的 tag; 去掉低质量的 tag; 去掉低重复度的 tag,比如重复次数为 1 的 tag。 最后,得到用于分析的 clean data。

2. 提取转录组的酶切位点序列,构建数据库。如果有基因组和基因结构注释文件,或者有参考转录组序列,则提取出基因的 3′ 端 CATG+17 碱基的序列。注意的是,如果基因结构注释文件没有 3′ UTR, 则只能将 tag 比对到基因组的 ORF 区了。此外,也有文章不进行序列提取,就直接用转录组序列作数据库,来进行 tag 的比对,这样的结果应该是不太好的。
http://research.stowers-institute.org/microarray/tag_tables/index.html网站貌似提供 perl 脚本来提取 CATG 序列。

3. 使用比对软件将 clean data 的 tag 序列比对到数据库上。

4. 根据比对结果来确定基因的表达量。以 TPM (number of transcripts per million tags) 来表示。

5. 根据表达量来做基因差异表达分析。

GO slim

1. GO slim简介

GO slims are cut-down versions of the GO ontologies containing a subset of the terms in the whole GO. They give a broad overview of the ontology content without the detail of the specific fine grained terms.
简单地讲,GO slim 能简化 GO 的注释结果,将所有的 GO 注释归类到指定的数个 GO 功能分类上。
点击进入:GO Database Guide
点击进入:Ontology Downloads

2. 安装go-perl

方法1:

# perl -MCPAN -e shell
cpan[1]> install GO::Parser

方法2:

$ wget http://search.cpan.org/CPAN/authors/id/C/CM/CMUNGALL/go-perl-0.15.tar.gz
$ tar zxf go-perl-0.15.tar.gz
$ cd go-perl-0.15
$ perl Makefile.PL
$ make 
$ sudo make install

查看go-perl的说明文档

$ perldoc go-perl.pod

3. map2slim的使用

用法:

$ map2slim GO_slims/goslim_generic.obo ontology/gene_ontology.obo \
gene-associations/gene_association.fb

gene_ontology.obo和goslim_generic.obo文件在Ontology Downloads中下载。
操作的对象为GAF格式文件。当然此文件可以用blast2go的专业版生成。

使用map2slim的优点是可以自己构建属于自己物种的obo文件,然后运行该程序查看所感兴趣的功能基因的数目。

4. 使用blast2go做goslim

blast2go做goslim就很简单了,但是只能使用官网所公认的几个obo文件做goslim。

RSEM的使用

RSEM简介

RSEM is a software package for estimating gene and isoform expression levels from RNA-Seq data. The RSEM package provides an user-friendly interface, supports threads for parallel computation of the EM algorithm, single-end and paired-end read data, quality scores, variable-length reads and RSPD estimation. In addition, it provides posterior mean and 95% credibility interval estimates for expression levels. For visualization, It can generate BAM and Wiggle files in both transcript-coordinate and genomic-coordinate. Genomic-coordinate files can be visualized by both UCSC Genome browser and Broad Institute’s Integrative Genomics Viewer (IGV). Transcript-coordinate files can be visualized by IGV. RSEM also has its own scripts to generate transcript read depth plots in pdf format. The unique feature of RSEM is, the read depth plots can be stacked, with read depth contributed to unique reads shown in black and contributed to multi-reads shown in red. In addition, models learned from data can also be visualized. Last but not least, RSEM contains a simulator.

edgeR的使用

1. edgeR简介与安装

edgeR,Empirical analsis of digital gene expression data in R. Differential expression analysis of RNA-seq and digital gene expression profiles with biological replication. Uses empirical Bayes estimation and exact tests based on the negative binomial distribution. Also useful for differential signal analysis with other types of genome-scale count data.

To install this package, start R and enter:

    source("http://bioconductor.org/biocLite.R")
    biocLite("edgeR")

To cite this package in a publication, start R and enter:

    citation("edgeR")

To open the edgeR User’s Guide, start R and enter. edgeRUsersGuide.pdf will be downloaded.

    library(edgeR)
    edgeRUsersGuide()

2.

Trinity进行转录组分析的一条龙服务

1. Trinity进行转录组组装

Trinity进行转录组组装的典型命令如下:

$ /opt/biosoft/trinityrnaseq_r20131110/Trinity.pl --seqType fq --JM 50G\
 --left sample1_1.clean.fastq sample2_1.clean.fastq\
 --right sample1_2.clean.fastq sample2_2.clean.fastq\
 --jaccard_clip --CPU 6 --SS_lib_type FR

–JM后的参数设定与转录组的大小有关,在内存足够的情况下,设定大点能节约时间;
–left 和 –right后可以接多个样平的数据,并用空格隔开,值得注意的是,left reads name以/1结尾,rigth reads name以/2结尾;
–jaccard_clip 适合于基因稠密的真菌物种;
–SS_lib_type 适合于链特异性测序

大数据量(>300M pairs)的RNA-seq数据,最好使用TRINITY_RNASEQ_ROOT/util/normalize_by_kmer_coverage.pl对reads进行处理后再使用trinity进行组装,以降低内存消耗和大量时间。
也可以设置–min_kmer_cov 2,丢弃uniquely occurring kmer, 从而降低内存消耗。

参考文献:
1. Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA, Amit I, Adiconis X, Fan L, Raychowdhury R, Zeng Q, Chen Z, Mauceli E, Hacohen N, Gnirke A, Rhind N, di Palma F, Birren BW, Nusbaum C, Lindblad-Toh K, Friedman N, Regev A. Full-length transcriptome assembly from RNA-seq data without a reference genome. Nat Biotechnol. 2011 May 15;29(7):644-52. doi: 10.1038/nbt.1883. PubMed PMID: 21572440.
2. Borodina T, Adjaye J, Sultan M. A strand-specific library preparation protocol for RNA sequencing. Methods Enzymol. 2011;500:79-98. PubMed PMID: 21943893.

2. Trinity输出结果的统计

Trinity默认的输出结果为:trinity_out_dir/Trinity.fasta。
该fasta格式文件中序列名例如:

>comp6749_c0_seq1 len=328 path=[471:0-83 388:84-208 679:209-327]
>comp6749_c0_seq2 len=328 path=[304:0-83 388:84-208 679:209-327]
>comp6749_c0_seq3 len=245 path=[901:0-125 679:126-244]

可以看到,trinity生成的结果为components,而一个components可能有多个seq。这相当于一个gene能有多个transcripts。

可以使用trinity自带的程序TrinityStats.pl对components和transcripts的数目,大小和N50等进行统计。

$ $TRINITY_HOME/util/TrinityStats.pl trinity_out_dir/Trinity.fasta
Total trinity transcripts:	40138
Total trinity components:	31067
Percent GC: 61.31

3. 将reads比对到转录组,并进行可视化

TRINITY_RNASEQ_ROOT/util/alignReads.pl能调用bowtie将reads map到转录组,并可以设置链特异性参数。

$ TRINITY_RNASEQ_ROOT/util/alignReads.pl --left left.fq --right right.fq --seqType fq\
 --target Trinity.fasta --aligner bowtie --retain_intermediate_files

结果中生成coordSorted和nameSorted的sam和bam文件。如果设置了链特异性参数,则额外生成+链和-链的比对结果文件。

TRINITY_RNASEQ_ROOT/util/SAM_nameSorted_to_uniq_count_stats.pl用于统计比对结果

$ $TRINITY_HOME/util/SAM_nameSorted_to_uniq_count_stats.pl bowtie_out.nameSorted.sam.+.sam
#read_type  count   pct
proper_pairs    21194964    93.22    both read pairs align to a single contig and point toward each other.
left_only   836213  3.68             only the left (/1) read is reported in an alignment
right_only  687576  3.02             only the right (/2) read is reported in an alignment
improper_pairs  16640   0.07         both left and right reads align, but to separate contigs, or to a single contig in the wrong expected relative orientations.

可以将Trinity.fasta导入到IGV中作为genome,上载bam文件,从而可视化比对结果。

4. 使用RSEM进行表达量计算

首先,需要下载最新版本的RSEM,安装并将程序加入到$PATH中。

$ wget http://deweylab.biostat.wisc.edu/rsem/src/rsem-1.2.8.tar.gz
$ tar zxf rsem-1.2.8.tar.gz
$ cd rsem-1.2.8
$ make
$ echo "PATH=$PWD:\$PATH" >> ~/.bashrc

使用$TRINITY_HOME/util/RSEM_util/run_RSEM_align_n_estimate.pl可以调用RSEM,从而计算表达量。如果是链特异性测序,则加入–SS_lib_type参数。

$TRINITY_HOME/util/RSEM_util/run_RSEM_align_n_estimate.pl --transcripts Trinity.fasta \
        --seqType fq --left left.reads.fq --right right.reads.fq --SS_lib_type FR \
        --prefix RSEM --thread_count 4 -- --bowtie-phred64-quals --no-bam-output

将rsem-calculate-expression程序的参数–bowtie-phred64-quals和–no-bam-output加入到run_RSEM_align_n_estimate.pl中,则如上所示。这两个参数分别代表fastq的质量格式是phred64,不输出bam文件(节约大量时间)。
若运行出现问题,点击:RSEM的README文件

结果生成两个abundance estimation information文件:
RSEM.isoforms.results : EM read counts per Trinity transcript
RSEM.genes.results : EM read counts on a per-Trinity-component (aka… gene) basis, ‘gene’ used loosely here.

可以根据得到的结果,去除掉IsoPct低于1%的transcripts。可以依据RSEM.isoforms.results使用TRINITY_RNASEQ_ROOT/util/filter_fasta_by_rsem_values.pl过滤掉trinity组装结果中的lowly supported transcripts。
但不推荐过滤掉这些序列。

5. 鉴定差异表达transcripts

Trinity可以使用Bioconductor package中的edgeR或DESeq来鉴定差异表达trancripts。因此,需要安装R和相关的一些包。

source("http://bioconductor.org/biocLite.R")
biocLite('edgeR')
biocLite('DESeq')
biocLite('ctc')
biocLite('Biobase')
install.packages('gplots’)
install.packages(‘ape’)

5.1 使用上一节中的RSEM来分别对每个样品的每个生物学重复进行表达量计算

5.2 将每个样的RSEM的结果进行合并

$ $TRINITY_HOME/util/RSEM_util/merge_RSEM_frag_counts_single_table.pl \
sampleA.RSEM.isoform.results sampleB.RSEM.isoform.results ... \
> transcripts.counts.matrix
$ TRINITY_HOME/util/RSEM_util/merge_RSEM_frag_counts_single_table.pl \
sampleA.RSEM.gene.results sampleB.RSEM.gene.results ... \
> genes.counts.matrix

然后修改生成的两个matrix文件的column headers(代表着样品和重复的名字),有利于下游的分析。如果要分析transcripts水平的差异表达,则使用transcripts.counts.matrix文件;若要分析gene水平的差异表达,则使用genes.counts.matrix。

5.3 无生物学重复进行差异表达分析

$TRINITY_HOME/Analysis/DifferentialExpression/run_DE_analysis.pl用于调用edgeR或DESeq进行差异表达基因分析。直接输入该命令查看其用法。
Trinty推荐使用edgeR进行差异表达分析。

$TRINITY_HOME/Analysis/DifferentialExpression/run_DE_analysis.pl \
--matrix counts.matrix --method edgeR

注意输入的matrix是counts的数据,而不要是FPKM的数据。

5.4 有生物学重复进行差异表达分析

首先,要建立文件samples_described.txt,内容为:

conditionA   condA-rep1
conditionA   condA-rep2

conditionB   condB-rep1
conditionB   condB-rep2

conditionC   condC-rep1
conditionC   condC-rep2

condA-rep1, condA-rep2, condB-rep1… 等对应着counts.matrix文件中的column names。
命令如下:

$TRINITY_HOME/Analysis/DifferentialExpression/run_DE_analysis.pl \
--matrix SP2.rnaseq.counts.matrix --method edgeR \
--samples_file samples_described.txt

结果文件中 logFC 是 log2 Fold Change; logCPM 是 log2-counts-per-million。

值得注意的是:程序默认去除counts数都少于10的transcripts或genes,不对其进行差异分析。所以有差异分析的genes或transcripts数目低于原始的数目。

5.5 提取差异表达基因,对其进行聚类分析

5.5.1 表达量的 normalized

使用TMM方法将counts转换为FPKM。
首先从1个样平的RSEM结果中提取长度数据:

$ cut -f 1,3,4 sampleA.RSEM.isoforms.results > feature_lengths.txt

然后使用TMM方法将counts数据转换为FPKM数据:

$ $TRINITY_HOME/Analysis/DifferentialExpression/run_TMM_normalization_write_FPKM_matrix.pl \
--matrix counts.matrix --lengths feature_lengths.txt

5.5.2 提取差异表达转录子

注意的是,这一步要在edgeR的结果文件中运行程序:

$ $TRINITY_HOME/Analysis/DifferentialExpression/analyze_diff_expr.pl \
--matrix matrix.TMM_normalized.FPKM -P 0.001 -C 2

默认下选择FDR值低于0.001,log2fold-change的绝对值>=2为差异表达基因。
程序输出差异表达基因FPKM、log2FC、FDR等值 和 聚类图 Heat Map.

5.5.3 根据聚类图提取子类

根据聚类结果,可以自动或手动确定子类。
自动确定子类:

$ $TRINITY_HOME/Analysis/DifferentialExpression/define_clusters_by_cutting_tree.pl \
--Ptree 20 -R file.all.RData

上例中从数的20%处来自动划分子类。
手动确定子类:

$ R
> load("all.RData") # check for your corresponding .RData file name to use here, replace all.RData accordingly
> source("$TRINITY_HOME/Analysis/DifferentialExpression/R/manually_define_clusters.R")
> manually_define_clusters(hc_genes, centered_data)
然后左键点击选择子类,右键结束选择

6. 提取蛋白编码区

使用transdecoder从trinity的转录子中提取coding region。最新版的transdecoder貌似有点问题。

$ $TRINITY_HOME/trinity-plugins/transdecoder/transcripts_to_best_scoring_ORFs.pl \
-t transcripts.fasta -m 100

默认下允许的最小的protein长度为100.
提取出了coding region,得出对应的protein序列,有利于于下一步的功能注释。

有关non-coding RNA的知识

1. rRNA

1.1 rRNA,即核糖体RNA,是3类RNA(tRNA,mRNA,rRNA)中相对分子质量最大,数量最多的一类RNA。它与蛋白质结合而形成核糖体,其功能是作为mRNA的支架,使mRNA分子在其上展开,实现蛋白质的合成。

1.2 原核生物和真核生物的核糖体均由大、小两种亚基组成。原核生物中,5S 和 23S rRNAs 在大亚基(large subunit)中, 16S rRNA 在小亚基(small subunit)中; 真核生物中,5S, 5.8S 和 28S rRNAs 在大亚基中,18S rRNA 在小亚基中。

1.3 原核生物中的 16S, 23S 和 5S rRNAs 通常串联在一起进行转录;真核生物中 18S, 28S 和 5.8S rRNAs 在一起形成一个转录单元,而 5S rRNA 则则为高度串联重复。

1.4 在大部分物种中,全基因组上一般有几个拷贝的 rRNA 转录单元,这些转录单元的序列差异通常很低,低于1%。最高的也只有11%。

RNAmmer的安装和使用

1. RNAmmer简介

RNAmmer是用来预测rRNA的软件。其官网页工具:RNAmmer 1.2 Server

参考文献:Lagesen K, Hallin P, Rødland E A, et al. RNAmmer: consistent and rapid annotation of ribosomal RNA genes[J]. Nucleic acids research, 2007, 35(9): 3100-3108.

RNAmmer的正常运行需要:1. perl的Getopt::Long和XML::Simple模块; 2. hmmsearch,需要安装hmmer-2.2g版本,使用最新版本会出错。

2. RNAmmer的安装和运行

下载并安装hmmer-2.2g
$ wget ftp://selab.janelia.org/pub/software/hmmer/2.2g/hmmer-2.2g.tar.gz
$ tar zxf hmmer-2.2g.tar.gz
$ cd hmmer-2.2g
$ ./configure --prefix=/opt/biosoft/hmmer-2.2g
$ mkdir -p /opt/biosoft/hmmer-2.2g/man/man1/ /opt/biosoft/hmmer-2.2g/bin
$ make; make install

安装RNAmmer,该软件需要使用edu邮箱去申请。
$ mkdir -p /opt/biosoft/rnammer-1.2
$ tar zxf rnammer-1.2.src.tar.Z -C /opt/biosoft/rnammer-1.2
$ perl -p -i -e 's/(my \$INSTALL_PATH).*/$1 = \"\/opt\/biosoft\/rnammer-1.2\";/' /opt/biosoft/rnammer-1.2/rnammer
$ perl -p -i -e 's/^(\s+\$HMMSEARCH_BINARY).*/$1 = \"\/opt\/biosoft\/hmmer-2.2g\/bin\/hmmsearch\";/' /opt/biosoft/rnammer-1.2/rnammer

检测是否正常运行
$ /opt/biosoft/rnammer-1.2/rnammer -S bac -multi -f rRNA.fasta\
 -h rRNA.hmmreport -xml rRNA.xml -gff rRNA.gff2\
 /opt/biosoft/rnammer-1.2/example/ecoli.fsa

3. RNAmmer的运行与参数

运行/opt/biosoft/rnammer-1.2/rnammer命令,参考manual文件/opt/biosoft/rnammer-1.2/man/rnammer.1, 该程序的运行参数如下:

USAGE:
$ rnammer [options] sequence.fasta

-S  指定输入序列的物种所属的界: arc bac 或 euk

-gff  生成的gff文件结果

-m  所需要预测的moleculers: 'tsu' for 5/8s rRNA, 'ssu' for 16/18s rRNA, 'lsu' for 23/28s rRNA。如果全部进行预测,则该参数后为 'tsu,ssu,lsu'。

-multi  并行运算,预测正反两条链上所有的moleculers。最多并行运行6个计算。使用该参数,则不需要上一个参数。

-f  生成的rRNA的fasta结果文件

-h  生成的hmm报告结果

-gff  生成的rRNA的gff2文件

-x  生成的xml结果文件

对真核生物基因组进行rRNA预测的一个示例命令:

$ /opt/biosoft/rnammer-1.2/rnammer -S euk -multi -f rRNA.fasta\
 -h rRNA.hmmreport -xml rRNA.xml -gff rRNA.gff2 genome.fasta

4. RNAmmer的结果

rRNA.fasta,rRNA.gff2,rRNA.hmmreport,rRNA.xml。

使用Trimmonmatic进行NGS reads的过滤与修剪

1. Trimmomatic

Trimmomatic使用JAVA运行,速度快。同时该软件进行reads QC的原理非常好。因此,最推荐使用此软件进行NGS reads的QC。
参考文献:Lohse M, Bolger AM, Nagel A, Fernie AR, Lunn JE, Stitt M, Usadel B. RobiNA: a user-friendly, integrated software solution for RNA-Seq-based transcriptomics. Nucleic Acids Res. 2012 Jul;40(Web Server issue):W622-7.

2. 常用例子

java -jar /opt/biosoft/Trimmomatic-0.30/trimmomatic-0.30.jar PE \
-threads 20 -phred33 reads1.fastq reads2.fastq \
reads1.clean.fastq reads1.unpaired.fastq reads2.clean.fastq reads2.unpaired.fastq \
ILLUMINACLIP:/opt/biosoft/Trimmomatic-0.30/adapters/TruSeq3-PE.fa:2:30:10 \
LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:50

3. 使用参数

有关该软件的详细使用方法,见: Trimmomatic: A flexible read trimming tool for Illumina NGS data

PE/SE
    设定对Paired-End或Single-End的reads进行处理,其输入和输出参数稍有不一样。
-threads
    设置多线程运行数
-phred33
    设置碱基的质量格式,可选pred64
ILLUMINACLIP:TruSeq3-PE.fa:2:30:10
    切除adapter序列。参数后面分别接adapter序列的fasta文件:允许的最大mismatch
数:palindrome模式下匹配碱基数阈值:simple模式下的匹配碱基数阈值。
LEADING:3
    切除首端碱基质量小于3的碱基
TRAILING:3
    切除尾端碱基质量小于3的碱基
SLIDINGWINDOW:4:15
    从5'端开始进行滑动,当滑动位点周围一段序列(window)的平均碱基低于阈值,则从该处进行切除。Windows的size是4个碱基,其平均碱基
质量小于15,则切除。
MINLEN:50
    最小的reads长度
CROP:<length>
    保留reads到指定的长度
HEADCROP:<length>
    在reads的首端切除指定的长度
TOPHRED33
    将碱基质量转换为pred33格式
TOPHRED64
    将碱基质量转换为pred64格式

EBARDenovo : A RNA-Seq Assembler

1. How to Install EBARDenovo on Linux

mono is demanded for runing EBARDenovo on linux.

yum install gcc gcc-c++ bison pkgconfig glib2-devel gettext make libpng-devel libjpeg-devel libtiff-devel libexif-devel giflib-devel libX11-devel freetype-devel fontconfig-devel  cairo-devel httpd httpd-devel

$ wget http://download.mono-project.com/sources/libgdiplus/libgdiplus-2.10.tar.bz2
$ tar jxf libgdiplus-2.10.tar.bz2
$ ./configure --prefix=/opt/mono
$ make -j 8; make install
$ echo 'export LD_LIBRARY_PATH=/opt/mono/lib:$LD_LIBRARY_PATH' >> ~/.bashrc
$ source ~/.bashrc

$ wget http://download.mono-project.com/sources/mono/mono-2.10.8.tar.bz2
$ tar jxf mono-2.10.8.tar.bz2
$ cd mono-2.10.8
$ ./configure --prefix=/opt/mono
$ make -j 8
$ make install
$ echo 'export PKG_CONFIG_PATH=/opt/mono/lib/pkgconfig:$PKG_CONFIG_PATH' >> ~/.bashrc
$ echo 'export PATH=/opt/mono/bin:$PATH' >> ~/.bashrc
$ source ~/.bashrc

$ wget http://ncu.dl.sourceforge.net/project/ebardenovo/EBARDenovo-1.2.2-20130404.zip
$ unzip EBARDenovo-1.2.2-20130404.zip
$ cd EBARDenovo-1.2.2-20130404/
$ mono EBARDenovo.exe -h

2. The parameters

Display parameters
-l :  no log file
-v :  no verbose mode

Quality parameters
-k (default 15): key size
-c (default 0) : minimal size of contig
-n (default 10): nail size
-e (default 8) : errors per N bp

Optional output parameters
-A : Ouput analysis information including coverage and alignment.
-G : skip output information for contig/gene groups to xxx-groups.txt
-P : skip output SNPs of contigs to xxx-snps.txt
-O : output small overlaps (<24bp) inside contigs to xxx-overlaps.fa
-L : output chimeric segments to xxx-delutions.fa

Execution parameters
-a : action 1: only building index files; action 2: save indices before assembly; action 3: action is to do assembly directly.
-d : the directory of index files.
-T : running threads of accelerating assembly

Help
-h

3. Several examples of running EBARDenovo

1. Most simplest
$ mono EBARDenovo.exe [-T 24] -o contigs.fasta read1.fq.gz read2.fq.gz

2. Using two stages of indexing and assembly (used usually):
$ mono EBARDenovo.exe -a 1 -d index -T 24 \
-c 200 -o contigs.fasta read1.fq.gz read2.fq.gz
$ mono EBARDenovo.exe -a 3 -d index -T 24 \
-c 200 -o contigs.fasta read1.fq.gz read2.fq.gz

3. With full calculation and parameters
$ mono EBARDenovo.exe -A -a 2 -d index -T 24 \
-k 15 -c 200 -n 10 -e 8 -O -L \
-o contigs.fasta read1.fq.gz read2.fq.gz

4. 注意事项

使用该软件应用于转录组的De novo组装,貌似要求short reads的长度要一致。
当使用的reads数据量过大的时候,容易出错:

Build indx (Bulks of 100000 spots) ...1 2 3 4 5 6 7 8 9 10 11 12 Too many heap sections: Increase MAXHINCR or MAX_HEAP_SECTS
Stacktrace:

  at (wrapper managed-to-native) object.__icall_wrapper_mono_array_new_specific (intptr,int) <0xffffffff>
  at BioAsia.GSLib.TxtNgsReader.RechargeBuf () <0x0003b>
  at BioAsia.GSLib.TxtNgsReader.Peek () <0x000eb>
  at BioAsia.GSLib.FastqNgsReader.ReadSeqLine () <0x00083>
  at BioAsia.EbarDenovo.EbarIndexing.BuildIndicesAndPairs (System.Collections.Generic.List`1) <0x007ea>
  at BioAsia.EbarDenovo.EbarIndexing.Build () <0x00283>
  at BioAsia.EbarDenovo.Program.Main (string[]) <0x0021b>
  at (wrapper runtime-invoke) .runtime_invoke_void_object (object,intptr,intptr,intptr) <0xffffffff>

Native stacktrace:

        mono() [0x495e64]
        /lib64/libpthread.so.0(+0xf500) [0x7fd088d7a500]
        /lib64/libc.so.6(gsignal+0x35) [0x7fd088a0a8e5]
        /lib64/libc.so.6(abort+0x175) [0x7fd088a0c0c5]
        mono() [0x5e8375]
        mono() [0x5e0108]
        mono() [0x5e058d]

AMOS的安装和使用

AMOS是最早的比较基因组组装软件。

AMOS的安装

AMOS的安装需要先安装MUMer和Qt。

$ wget http://jaist.dl.sourceforge.net/project/mummer/mummer/3.23/MUMmer3.23.tar.gz
$ tar zxf MUMmer3.23.tar.gz
$ cd MUMmer3.23/
$ make check
$ make install
$ echo PATH=$PWD/:'$PATH' >> ~/.bashrc

$ wget http://download.qt-project.org/official_releases/qt/4.8/4.8.5/qt-everywhere-opensource-src-4.8.5.tar.gz
$ tar zxf qt-everywhere-opensource-src-4.8.5.tar.gz
$ cd qt-everywhere-opensource-src-4.8.5
$ ./configure
$ sudo yum install gstreamer-plugins-base-devel*
$ gmake -j 8
$ sudo gmake install
$ echo 'PTAH=/usr/local/Trolltech/Qt-4.8.5/bin:$PATH' >> ~/.bashrc

$ wget http://nchc.dl.sourceforge.net/project/amos/amos/3.1.0/amos-3.1.0.tar.gz
$ tar zxf amos-3.1.0.tar.gz
$ cd amos-3.1.0
$ ./configure --prefix=/opt/biosoft/amos 
$ make -j 8      然后报错,接着进行下两步(此步省略了貌似会出其它问题)
$ sed '1i\#include ' src/Align/find-tandem.cc > tmp
$ mv tmp src/Align/find-tandem.cc
$ make -j 8      继续报错,不管
$ make -j 8      make 成功
$ make install