human genome h38 infromation downloading

Writing date: 2015-11-17.

The latest Human Genome assembly version is : GRCh38 (GCA_000001405.15) . GRch38: Genome Reference Consortium Human Reference 38.

The GRch38 genome browses:
UCSC http://genome.ucsc.edu/cgi-bin/hgGateway
Ensembl http://www.ensembl.org/Homo_sapiens/Info/Index
Vega http://vega.sanger.ac.uk/Homo_sapiens/Info/Index
GENCODE http://www.gencodegenes.org/human_biodalliance.html

The downloading website of GRch38 information in Ensembl: http://www.ensembl.org/info/data/ftp/index.html
I recommend to download gh38 sequence functional annotations from Ensembl: ftp://ftp.ensembl.org/pub/release-82/genbank/homo_sapiens/.

mdkir sequence_annotation
cd sequence_annotation
lftp -e "mirror -c --parallel=5 /pub/release-82/genbank/homo_sapiens/" ftp://ftp.ensembl.org
cd ..

The downloading website of GRch38 information in GENCODE: http://www.gencodegenes.org/releases/23.html
I recommend to download gh38 fasta and gff3 files from GENCODE. These 2 files would be the main fasta and gff3 files for most users.

wget ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_23/GRCh38.primary_assembly.genome.fa.gz
wget ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_23/gencode.v23.basic.annotation.gff3.gz

SVG更改坐标系原点位置

在使用FigTree画树后。由于设置字体大小>14,于是导致export出来的图片中最上面一行字被截断了,从而使图片很丑。于是export出SVG格式文件。然后修改SVG坐标系原点位置,将图片完整显示出来。

在 <svg xmlns… 这行尾部添加 transform=”translate(0,20)” 解决。

纤维素,半纤维素和果胶的成份及其降解酶

1. 纤维素

Cellulose is a dominant structural polysaccharide in plants composed ofβ -D-glucose units with β-1,4-linkages.

Cellulose decomposition requires multiple enzymes. In general, cellulose is degraded to cellodextrin or cellobiose by the synergistic action of two cellulases: endoglucanase (EC 3.2.1.4) and cellobiohydrolase (EC 3.2.1.91) (Tomme et al., 1995; Bayer et al., 1998). Degradation of cellodextrin or cellobiose into monomeric glucose units requires another enzyme, β-glucosidase (EC 3.2.1.21), that hydrolyzes non-reducing 1,4-linked-β-glucose (Henrissat et al., 1989).

2. 半纤维素

Cellulose fibers are cross-linked by other polysaccharides called `hemicelluloses’ to increase the physical strength of the cell wall. Hemicelluloses include xylan (β-D-xylose units with β-1,4-linkages), glucomannan (β-D-mannose units andβ -D-glucose units with β-1,4-linkages), xyloglucan (β-D-glucose units with β-1,4-linkages, andβ -D-xylose and β-D-glucose units withβ -1,6-linkages), 1,3-1,4-β-glucan (β-D-glucose units with β-1,3- and β-1,4-linkages), and a relatively small amount of other polysaccharides composed of β-D-glucose,β -D-xylose, β-D-mannose and other sugar units with various linkages (McNeill et al., 1984).

3. 果胶

The scaffold of cellulose and hemicelluloses is filled with pectin (α-D-galacturonic acid units with mainly α-1,4-linkages), which functions as a cement-like substance in the cell wall.

reference:
Sakamoto, Kentaro, and Haruhiko Toyohara. “A comparative study of cellulase and hemicellulase activities of brackish water clam Corbicula japonica with those of other marine Veneroida bivalves.” Journal of Experimental Biology 212.17 (2009): 2812-2818.

通过WIG格式将转录组数据展示到Gbrowse2中

1. WIG格式介绍

WIG格式(Wiggle Track Format),可用于将转录组数据进行可视化展示。bigWig格式则是WIG格式的二进制方式,可以使用wigToBigWig将WIG格式转换成BigWig格式。
一个 WIG 格式实例文件:

track type=wiggle_0 name="sampleA1" description="RNA-Seq read counts of species A"
variableStep chrom=chr01 span=10
10001    13
10011    15
10021    12
fixedStep chrom=chr01 start=100031 step=10 span=10
17
15
20

如上例子,有2个注意点:

1. 第一行必须如理示例中格式。只有name和description这两个参数的值可以随意填写。
2. 有两种方法进行数据描述。分别是variableStep和fixedStep。前者数据内容用2行表示,后者数据部分仅用1行表示。
3. 这两种方法的几个参素意义为:
    chrom    设置序列名
    start    fixStep中Locus的起始位置
    step     fixStep中Locus的步进
    span     一个数据对应碱基数目

2. 将Bam文件转换成WIG文件并进行压缩

使用bam2wig命令将bam文件转换成wig文件。bam2wig命令可以来自于Augustus软件。

$ bam2wig sampleA1.tophat.bam > sampleA1.wig

该wig文件的span参数值为1。因此,当基因组越大,则wig文件越大。可以考虑设置span的值为10,能有效减小wig文件的大小。例如编写如些perl程序进行压缩wig文件。

#!/usr/bin/perl
use strict;

my $usage = <<USAGE;
Usage:
    perl $0 RNA-Seq.wig > RNA-Seq.cutdown.wig
USAGE
if (@ARGV==0){die $usage}

open IN, $ARGV[0] or die $!;

$_ = <>;
print;

my $locus = 1;
my $count = 0;
while () {
    if (m/^variableStep/) {
        $count = int(($count + 0.5) / 10);
        print "$locus\t$count\n" if $count > 0;
        s/$/ span=10/;
        print;
        $locus = 1;
    }
    else {
        if (m/(\d+)\s+(\d+)/) {
            my ($num1, $num2) = ($1, $2);
            if ($num1 >= $locus + 10) {
                $count = int(($count + 0.5) / 10);
                print "$locus $count\n" if $count > 0;
                $locus = $num1;
                $count = 0;
            }
            $count += $num2;
        }
    }
}

3. 将wig文件转换成wig binary文件和一个gff3文件

使用Gbrowse2所带命令 wiggle2gff3.pl 将wig文件转换成wig binary文件和一个gff3文件。每个基因组序列得到一个二进制格式的wig文件。同时生成一个gff3文件。该gff3文件指向所有的wig binary文件。

$ mkdir $PWD/gbrowse_track_of_RNA_seq
$ wiggle2gff3.pl --source=sampleA1 --method=RNA_Seq --path=$PWD/gbrowse_track_of_RNA_seq --trackname=track_A1 sampleA1.wig > sampleA1.gff3

4. 导入gff3文件到数据库,并配置Gbrowse配置文件

导入gff3文件

$ bp_seqfeature_load.pl -a DBI::mysql -d gbrowse2_species -u train -p 123456 sampleA1.gff3

配置文件:

[RNA_Seq_sampleA1_xyplot]
feature        = RNA_Seq:sampleA1
glyph          = wiggle_xyplot
graph_type     = boxes
height         = 50
scale          = right
description    = 1
category       = RNA-Seq:sampleA1
key            = Transcriptional Profile

[RNA_Seq_sampleA1_density]
feature        = RNA_Seq:sampleA1
glyph          = wiggle_density
height         = 30
bgcolor        = blue
description    = 1
category       = RNA-Seq:sampleA1
key            = Transcriptional Profile

Installing QIIME-1.9.1 on CentOS 6.5 (By Yue Zheng)

此方法是由郑越同学提供的。

QIIME consists of native Python 2 code and additionally wraps many external applications. As a consequence of this pipeline architecture, QIIME has a lot of dependencies and can be very challenging to install.

1. Setting up qiime-deploy on CentOS

1.1 sudo vim /etc/yum.repos.d/zeromq.repo

Paste the following into that file:

[home_fengshuo_zeromq]
name=The latest stable of zeromq builds (CentOS_CentOS-6)
type=rpm-md
baseurl=http://download.opensuse.org/repositories/home:/fengshuo:/zeromq/CentOS_CentOS-6/
gpgcheck=1
gpgkey=http://download.opensuse.org/repositories/home:/fengshuo:/zeromq/CentOS_CentOS-6/repodata/repomd.xml.key
enabled=1

Save and exit that file

1.2 Install the qiime-deploy dependencies on your machine

sudo yum groupinstall -y "development tools"
sudo yum install -y ant compat-gcc-34-g77 java-1.6.0-openjdk java-1.6.0-openjdk-devel freetype freetype-devel zlib-devel mpich2 readline-devel zeromq zeromq-devel gsl gsl-devel libxslt libpng libpng-devel libgfortran mysql mysql-devel libXt libXt-devel libX11-devel mpich2 mpich2-devel libxml2 xorg-x11-server-Xorg dejavu* python-devel sqlite-devel tcl-devel tk-devel R R-devel ghc

2. Installing requisite Python and R packages

# Installing sqlite-devel
sudo yum install sqlite-devel –y

# Installing Python 2.7
wget https://www.python.org/ftp/python/2.7.8/Python-2.7.8.tgz
tar xf Python-2.7.8.tgz
cd Python-2.7.8
./configure --prefix=/usr
make && make install

# Install setuptools & pip
# First get the setup script for Setuptools:
wget https://bitbucket.org/pypa/setuptools/raw/bootstrap/ez_setup.py
# Then install it for Python 2.7 :
sudo python2.7 ez_setup.py
# Now install pip using the newly installed setuptools:
sudo easy_install-2.7 pip
# With pip installed you can now do things like this:
pip2.7 install [packagename]

# Install virtualenv for Python 2.7
sudo pip2.7 install virtualenv

# Check the system Python interpreter version
python --version
# This will show Python 2.7.8

# Maybe you will found yum can not be used this moment, because yum is associated with python2.6. Thus, we modified the yum conf files to use python2.6
sudo vim /usr/bin/yum
# Replace “#!/usr/bin/python” by “#!/usr/bin/python2.6”
# Installing R packages
# Run R and execute the following commands
install.packages(c('ape', 'biom', 'optparse', 'RColorBrewer', 'randomForest', 'vegan'))
source('http://bioconductor.org/biocLite.R')
biocLite(c('DESeq2', 'metagenomeSeq'))
q()

3. Install the latest QIIME release and its base dependencies is with pip

sudo pip2.7 install numpy
sudo pip2.7 install qiime -v
# For Chines user, you may find the suspend of pip, as the limitation of network. For example, If FastTree cannot be download, you can download it by another port of internet, and then post the install package into your local address. Next step, Downloading the qiime-1.9.1.tar.gz and changing the description of FastTree in setpu.py. After you modified the qiime-1.9.1.tar.gz you can post it into your local address. Finally, run sudo pip2.7 install qiime –v –i [local address]

# Installing QIIME 1.9.0's dependencies
# Downloading the zip packages of ‘qiime deploy’ and ‘qiime deploy conf’ from Github
cd
unzip qiime-deploy-master.zip qiime-deploy-conf-master.zip
mkdir ~/qiime_software
cd qiime-deploy-master
sudo python2.7 qiime-deploy.py ~/qiime_software/ -f ~/qiime-deploy-conf/qiime/qiime-1.9.1/qiime.conf --force-remove-failed-dirs
# After this step, it will display the list including ‘Packages deployed successfully’, ‘Packages skipped’ and ‘Packages failed to deply’
source ~/qiime_software/active.sh
print_qiime_config.py –tf

# If there are some packages were uninstalled, you should install them manually
# For example, usearch and amplicannoise were failed to install.

# Installing usearch manually
# Visting http://www.drive5.com/usearch/download.html to download the USEARCH v5.2.236
# Moving the binary file into /usr/bin and change the name as usearch, then chmod 755 [the binary file] 

# Installing usearch manually
# Downloading the AmpliconNoiseV1.27.tar.gz
tar -xvzf AmpliconNoiseV1.27.tar.gz
cd AmpliconNoiseV1.27
make clean
make
make install
echo "export PATH=$HOME/AmpliconNoiseV1.27/Scripts:$HOME/AmpliconNoiseV1.27/bin:$PATH" >> $HOME/.bashrc
echo "export PYRO_LOOKUP_FILE=$HOME/AmpliconNoiseV1.27/Data/LookUp_E123.dat" >> $HOME/.bashrc
echo "export SEQ_LOOKUP_FILE=$HOME/AmpliconNoiseV1.27/Data/Tran.dat" >> $HOME/.bashrc

# PATH Environment Variable
echo "export PATH=$HOME/bin/:$PATH" >> $HOME/.bashrc
source $HOME/.bashrc

# Finnaly verification
source ~/qiime_software/active.sh
print_qiime_config.py –tf

邮件服务器的简单搭建

1. 邮件服务器域名解析

首先,我在万网上解析域名如下:

记录类型    主机记录    记录值
A           mail        115.29.105.12
MX          @           mail.chenlianfu.com
TXT         @           v=spf1 a mx -all

2. CentOS postfix 设置

然后修改 CetnOS 系统下的 PostFix 的配置文件 /etc/postfix/main.cf , 修改的内容如下:

myhostname = mail.chenlianfu.com
mydomain = chenlianfu.com
myorigin = $mydomain
inet_interfaces = all
inet_protocols = ipv4
mydestination = $myhostname, localhost.$mydomain, localhost, $mydomain
mynetworks = 127.0.0.0/8, 168.100.189.0/28, hash:/etc/postfix/access
relay_domains = $mydestination
home_mailbox = Maildir/
mail_spool_directory = /var/spool/mail
message_size_limit = 52428800

然后运行如下命令启动 Postfix 服务:

# postmap hash:/etc/postfix/access 
# postalias hash:/etc/aliases
# /etc/init.d/postfix check
# /etc/init.d/postfix restart
# 

3. 使用 mail 命令发送邮件

mail命令参数:

-s subject
    邮件的标题。若标题有空格,则需要使用引号。
-a attachment
    将目标文件作为附件发送。若有多个附件需要发送,则使用多个该参数。
-c address
    抄送副本到邮件地址列表。这些邮件地址使用逗号分隔。抄送的邮件地址和收件人地址能
被所收件地址看到。
-b address
    暗送的邮件地址列表。这些邮件地址使用逗号隔开。暗送的邮件地址不能被其收件地址看
到。故mail命令不能将邮件分别发送到邮件地址列表。

使用例子:

$ mail -s "a e-mail subject" -a ./test.tar.gz chenllianfu@foxmail.com < mail_content
$ cat mail_content | mail -s "a e-mail subject" -a ./test.tar.gz chenllianfu@foxmail.com
$ echo "mail_content" | mail -s "a e-mail subject" -a ./test.tar.gz chenllianfu@foxmail.com
$ mail -s "a e-mail subject" -a ./test.tar.gz chenllianfu@foxmail.com
input
EOT

FileZilla在CentOS系统上的安装

由于编译或使用新版本需要高版本的 GCC 和 wxWidgets,因此,不推荐使用新版本的 FileZilla。Fileilla 官网仅提供了最新版本的下载链接。可以到sourceforge上下载旧版本。

$ wget http://sourceforge.net/projects/filezilla/files/FileZilla_Client/3.5.3/FileZilla_3.5.3_x86_64-linux-gnu.tar.bz2
$ tar jxf FileZilla_3.5.3_x86_64-linux-gnu.tar.bz2 -C /opt/
$ echo 'PATH=$PATH:/opt/FileZilla3/bin/' >> ~/.bashrc
$ source ~/.bashrc 
$ filezilla

AGP格式简单说明

AGP文件为NCBI数据上传要求的标准格式,用来描述小片段序列(比如contig)如何构成大片段序列(比如scaffold和chromosome)。详细的说明文档请见:http://www.ncbi.nlm.nih.gov/projects/genome/assembly/agp/AGP_Specification.shtml
AGP文件有9列,分别是:

1. 大片段的序列名(object)
2. 大片段起始(object_begin)
3. 大片段结束(object_end)
4. 该段序列在大片段上的编号(part_number)
    一般一个大片段由多个小片段和gap组成。此处则为这些小片段和gap在大片段上的编号。
5. 该段序列的类型(component_type)
    常用的是W、N和U。W表示WGS contig;N表示指定大小的gap;U表示不明确长度的gap,一般用100bp长度。
6. 小片段的ID或gap长度(component_id or gap_length)
    如果第5列不为N或U,则此列为小片段的ID。
    如果第5列是N或U,则此列为gap的长度。如果第5列为U,则此列值必须为100。
7. 小片段起始或gap类型(component_begin or gap_type)
    如果第5列是N或U,则此列表示gap的类型。常用的值是scaffold,表示是scaffold内2个contigs之间的gap。其它值有:contig,2个contig序列之间的unspanned gap,这样的gap由于没有证据表明有gap,应该要打断大片段序列;centromere,表示中心粒的gap;short_arm,a gap inserted at the start of an acrocentric chromosome;heterochromatin,a gap inserted for an especially large region of heterochromatic sequence;telomere,a gap inserted for the telomere;repeat,an unresolvable repeat。
8. 小片段结束或gap是否被连接(component_end or linkage)
    如果第5列是N或U,则此列一般的值为yes,表示有证据表明临近的2个小片段是相连的。
9. 小片段方向或gap的连接方法(orientation or linkage_evidence)
    如果第5列不为N或U,则此列为小片段的方向。其常见的值为 +、-或?。
    如果第5列是N或U,则此列表明临近的2个小片段能连接的证据类型。其用的值是paired-ends,表明成对的reads将小片段连接起来。其它值有:na,第8列值为no的时候使用;align_genus,比对到同属的参考基因组而连接;align_xgenus,比对到其它属的参考基因组而连接;align_trnscpt,比对到同样物种的转录子序列上;within_clone,gap两边的序列来自与同一个clone,但是gap没有paired-ends跨越,因此这种连接两边小片段无法确定方向和顺序;clone_contig,linkage is provided by a clone contig in the tiling path (TPF);map,根据连锁图,光学图等方法确定的连接;strobe,根据PacBio序列得到的连接;unspecified。如果有多中证据,则可以写上多种证据,之间用分号分割。

例子:
Scaffold from component (WGS)
Chromosome from scaffold (WGS)