1. 文献报道
An efficient approach to finding Siraitia grosvenorii triterpene biosynthetic genes by RNA-seq and digital gene expression analysis 报道:
Prior to mapping reads to the reference database, we filtered all sequences to remove adaptor sequence, low quality sequences (tags with unknown sequences ‘N’), empty tags (sequence with only adaptor sequences but no tags); low complexity, and tags with a copy number of 1 (probably sequencing error). A preprocessed database of all possible CATG+17 nucleotide tag sequences was created using our transcriptome reference database. For annotation, all tags were mapped to the reference sequences and only allowed 1 or fewer nucleotide mismatches. All the tags mapped to reference sequences from multiple genes were filtered and the remaining tags were designed as unambiguous tags. For gene expression analysis, the number of expressed tags was calculated and then normalized to TPM (number of transcripts per million tags); and the differentially expressed tags were used for mapping and annotation.
3′ tag digital gene expression profiling of human brain and universal reference RNA using Illumina Genome Analyzer 报道:
Illumina Pipeline Software version 1.0 was used for off-instrument data processing. Images from every sequencing cycle were converted to signal intensities using Illumina Pipeline’s FireCrest v.1.9.5. Next, Bustard v.1.9.5 was run to perform base calling using the intensity values and calculate quality scores for every base. The 16-base long reads (excluding the 4-base DpnII recognition site) were aligned to DpnII tag tables generated by Stowers Institute http://research.stowers-institute.org/microarray/tag_tables/index.html webcite using megaBLAST with word size of 12 and low-complexity region filtering turned off. Only reads that perfectly matched to tag tables without mis-matches and gaps were considered. From this set, reads that could be aligned to the Stowers’ repeat tag table were excluded (the repeat tag table contains any reads aligned to ≥ 2 locations, unless all locations are from the same gene). The remaining reads were aligned to the combination of canonical (exonic and splice junction tags from protein-coding transcripts), mitochondrial (tags from any mitochondrion-associated transcripts encoded by both genomic and mitochondrial DNA), and ribosomal (tags from rRNA or tRNA) tag tables. Reads mapping on genes with multiple homologous family members were excluded from our analysis. When there were multiple types of tags aligned to different locations of the same gene, the gene expression levels are represented by the summation of all.
Digital gene expression analysis of two life cycle stages of the human-infective parasite, Trypanosoma brucei gambiense reveals differentially expressed clusters of co-regulated genes 报道:
All tags were mapped to the in silico generated transcriptome of T. b. brucei TREU 927[35], the most closely related fully annotated genome available to the T. b. gambiense strain, using MAQ program maq-0.6.8_x86_64-linux[65], allowing for a 2 bp mismatch between the tag and the reference transcriptome. The in silico transcriptome did not contain 5′ or 3’UTR sequences as these have not been defined in T. brucei. Tags that were generated with a poor quality sequencing score were removed from the analysis. A mapping quality score of 40, incorporating sequence quality and ability of the tag to map to one unique site in the transcriptome, was used to identify tags that align uniquely to the reference sequence. The aligned tags will be available in TritrypDB[35]. This study was limited to tags that map to open reading frames only and does not show tags that map to mRNA with long 3’UTRs.
2. 方法总结
通过文献中的方法,分析 3′ tag 方法的几点注意事项:
1. 对 tag 数据进行预处理。去掉以下序列:含有 adaptor 的 tag; 去掉低质量的 tag; 去掉低重复度的 tag,比如重复次数为 1 的 tag。 最后,得到用于分析的 clean data。
2. 提取转录组的酶切位点序列,构建数据库。如果有基因组和基因结构注释文件,或者有参考转录组序列,则提取出基因的 3′ 端 CATG+17 碱基的序列。注意的是,如果基因结构注释文件没有 3′ UTR, 则只能将 tag 比对到基因组的 ORF 区了。此外,也有文章不进行序列提取,就直接用转录组序列作数据库,来进行 tag 的比对,这样的结果应该是不太好的。
http://research.stowers-institute.org/microarray/tag_tables/index.html网站貌似提供 perl 脚本来提取 CATG 序列。
3. 使用比对软件将 clean data 的 tag 序列比对到数据库上。
4. 根据比对结果来确定基因的表达量。以 TPM (number of transcripts per million tags) 来表示。
5. 根据表达量来做基因差异表达分析。