BLAST+中makeblastdb参数详解

发表于2012 年 8 月 22 日由chenlianfu

一般我们是有一个fasta文件用来格式化数据库，以前的命令是formatdb，现在是makeblastdb

一般用到的格式如下：

makeblastdb -in input_file -dbtype molecule_type -title database_title -parse_seqids -out database_name -logfile File_Name

注意：BLAST+2.2.24中这个参数不要加 -parse_seqids，不然成死循环

-in 后接输入文件，你要格式化的fasta序列
-dbtype 后接序列类型，nucl为核酸，prot为蛋白
-title 给数据库起个名，好看~~(不能用在后面搜索时-db的参数)
-parse_seqids 推荐加上，现在有啥原因还没搞清楚
-out 后接数据库名，自己起一个有意义的名字，以后blast+搜索时要用到的-db的参数
-logfile 日志文件，如果没有默认输出到屏幕

MegaBlast/Discontiguous MegaBlast/BlastN, Blastp/PSI-Blast/PHI-BLAST的区别与选择

发表于2012 年 8 月 22 日由chenlianfu

从blastn页面上的简单帮助可以看到Highly similar sequences (megablast)多用于比较相似性比较高（相似性在95%以上）的序列，速度快；More dissimilar sequences (discontiguous megablast)用于相似性稍低于megablast的比对，但是灵敏度和精确度更高，多用于不同物种间的同源比对；而Somewhat similar sequences (blastn)用于比对相似性较差的序列，可以比对最短7个碱基的长度，所以比对精确度最高，比对结果最多，速度最慢。

所以，在选择的时候根据你提交的序列和搜索的目的进行选择，如果是想看这段序列在数据库当中是否有收录，可以用megablast，如果想用其他物种的基因注释信息来注释一个未注释物种的序列，可以选择discontiguous megablast，如果想得到更多更全面的结果，可以选择blastn。

说完blastn，接着说blastp~blsatp中也有三个不同的算法可以选择，如下：

blastp (protein-protein BLAST)就是简单地进行蛋白与蛋白的比对，寻找蛋白质相
似序列；

PSI-BLAST (Position-Specific Iterated BLAST)叫做位点特异性迭代比对，它
在蛋白质数据库中循环搜索查询蛋白质，所有前一次被psi-blast发现的统计显著蛋白质序
列将整合成新记分矩 阵，通过多次迭代比对，直到不再发现统计显著的新蛋白质；

PHI-BLAST (Pattern Hit Initiated BLAST)可以在搜索的时候限定蛋白质的模式
（pattern），只给出包含此模式的比对结果。

Blastp/PSI-Blast/PHI-BLAST都是蛋白序列与蛋白序列之间的Blast比对

1. Blastp: 标准的蛋白序列与蛋白序列之间的比对 Standard protein BLAST is designed for protein searches. Blastp用于确定查询的氨基酸序列在蛋白数据库中找到相似的序列。跟其它的Blast程序一样，目的是要找到相似的区域。

2. PSI-BLAST : 敏感度更高的蛋白序列与蛋白序列之间的比对 PSI-BLAST is designed for more sensitive protein-protein similarity searches. Position-Specific Iterated (PSI)-BLAST，是一种更加高灵敏的Blastp程序，对于发现远亲物种的相似蛋白或某个蛋白家族的新成员非常有效。当你使用标准的Blastp 比对失败时，或比对的结果仅仅是一些假基因或推测的基因序列时（”hypothetical protein” or “similar to…”），你可以选择PSI-BLAST重新试试。

3. PHI-BLAST : 模式发现迭代BLAST PHI-BLAST can do a restricted protein pattern search. PHI-BLAST, 模式发现迭代BLAST, 用蛋白查询来搜索蛋白数据库的一个程序。仅仅找出那些查询序列中含有的特殊模式的对齐。

PHI的语法详细介绍看这里：http://www.ncbi.nlm.nih.gov/blast/html/PHIsyntax.html

BLAST本地化web运用

发表于2012 年 8 月 22 日由chenlianfu

1 首先，到ftp://ftp.ncbi.nih.gov/blast/executables/release/LATEST/下载最新的版本的wwwblast（写此文时对应的文件为wwwblast-2.2.26-x64-linux.tar.gz）。

2 然后将该文件解压，不需要configure即可使用。设定所解压的文件夹到Apache路径。

将如下几行加入/etc/httpd/conf/httpd.conf中：

Alias /blast "/home/chenlianfu/programs/wwwblast"
<Directory "/home/chenlianfu/programs/wwwblast">
    Options ExecCGI
    AllowOverride None
    Order allow,deny
    Allow from all
</Directory>

3 wwwblast默认提供了2个test的数据库，分别为test_na_db和test_aa_db。

4 使用正常版的 makeblastdb 来创建数据库，并将数据库文件放入文件夹的db/路径下。

5. 修改 blast.html 文件。

<select name = "DATALIB">
    <option VALUE = “nt"> nt #nt是db/中所含有的数据库名。
</select>

6. 修改 blast.rc 配置文件。

NumCpuToUse     6	＃CPU的使用个数，线程数。

blastn nt	＃blastn程序所能使用的数据库名称。
tblastn nt	＃下同...
tblastx nt 
blastp 
blastx

5 这样，在网页中就能正常使用nt数据库了。

UniProtKB

发表于2012 年 8 月 22 日由chenlianfu

1. UniProtKB是什么？
UnProtKB，即UniProt Knowledgebase,主要包括两个部分：UniProtKB/Swiss-Prot和UniProtKB/TrEMBL。其中UniProtKB/Swiss-Prot是手工注释和修正的蛋白质数据库，来源于实验结果，计算和科学结论。后来由于精确的Swiss-Prot数据库无法很快得以扩大从而包含所有的蛋白质序列，于是加入了UniProtKB/TrEMBL这一部分，该数据库蛋白质主要经过计算来进行自动注释，分类。
2. UniProtKB下载
UniProt数据库每4周更新一次，下载地址如下：

ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz
ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_trembl.fasta.gz
uniprot_sprot.fasta.gz 56M
uniprot_trembl.fasta.gz 5.3G

blast2go本地数据库安装和使用

发表于2012 年 8 月 22 日由chenlianfu

1. Download all necessary files

$ wget -c -b -N --progress=dot:mega http://archive.geneontology.org/latest-full/go_201310-assocdb-data.gz
$ wget -c -b -N --progress=dot:mega ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene_info.gz
$ wget -c -b -N --progress=dot:mega ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2accession.gz   
$ wget -c -b -N --progress=dot:mega ftp://ftp.pir.georgetown.edu/databases/idmapping/idmapping.tb.gz
$ wget -c -b -N --progress=dot:mega http://blast2go.com/data/blast2go/b2g4pipe_v2.5.zip
$ wget -c -b -N --progress=dot:mega http://blast2go.com/public-data/local_b2g_db.zip

2. Unzip all files

$ gzip -dv go_201207-assocdb-data.gz
$ gzip -dv gene_info.gz
$ gzip -dv gene2accession.gz
$ gzip -dv idmapping.tb.gz
$ unzip b2g4pipe_v2.5.zip
$ unzip local_b2g_db.zip
$ cp  local_b2g_db/* .

4. Execute the file b2gdb.sql to create a database (default name: b2gdb), additional tables and a public user for restricted access (select only).

新建文件install_blast2goDB.sh,内容如下：

#!/bin/sh
godbname=go_201310-assocdb-data
dbname=b2g
dbuser=root
dbpass=passwd     #这个要修改
dbhost=127.0.0.1
path=$PWD         #数据文件的存放路径

mysql -h$dbhost -P 3306 -u$dbuser -p$dbpass $dbname < b2gdb.sql 

mysql -h$dbhost -u$dbuser -p$dbpass $dbname < $godbname
mysql -h$dbhost -u$dbuser -p$dbpass $dbname -e"LOAD DATA LOCAL INFILE '$path"/gene2accession"' INTO TABLE gene2accession FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n';"
mysql -h$dbhost -u$dbuser -p$dbpass $dbname -e"LOAD DATA LOCAL INFILE '$path"/gene_info"' INTO TABLE gene_info FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n';"

echo Finished, now use Blast2GO to import the PIR mapping data-file;

使install_blast2goDB.sh可执行，并运行该脚本。这一步是将 go_201207-assocdb-data.gz，gene2accession 和 gene_info 导入到 b2gdb 数据库中.

5. Import the last mapping file from PIR

新版本中是这样运行的：

java -cp .:mysql-connector-java-5.0.8-bin.jar: ImportIdMapping \
idmapping.tb localhost b2gdb blast2go blast4it

新版本中local_b2g_db.zip中有3个文件：b2gdb.sql，ImportIdMapping.class和mysql-connector-java-5.0.8-bin.jar。通过b2gdb.sql文件来创建mysql数据库和表。通过该文件，其创建了数据库b2gdb,并创建了对b2gdb数据库拥有最大权限的mysql用户blas2go,密码blast4it。同时运行上述命令需要同时有ImportIdMapping.class和mysql-connector-java-5.0.8-bin.jar这两个文件。
这一步是将 idmapping.tb 导入到本地 b2gdb 数据库的 gi2uniprot 表中。

老版本中是这样运行的：

/home/chenlianfu/programs/jre1.7.0_04/bin/java -cp /home/chenlianfu/programs/blast2go/b2g4pipe/blast2go.jar:/home/chenlianfu/programs/blast2go/b2g4pipe/ext/mysql-connector-java-3.0.11-stabes-bin.jar es.blast2go.prog.util.ImportPIR idmapping.tb localhost b2g root passwd TRUE

老版本中一定要注意，命令要靠自己用手亲自敲打出来，如果复制的话，一般会出现如下错误

Problem connecting to database b2g on localhost as root with password starts with *********: com.mysql.jdbc.Driver

而导入正常会提示为

Open database connection to database b2g on 127.0.0.1 as root with password starts with *********
Open database connection to database b2g on 127.0.0.1 as root with password starts with *********

复制出的语句和手工敲打出的语句表面看上去一模一样，可能某个符号其实不一致，机器识别不出来。这点千万要注意，不然第 5 步死活也过不去了。

可以通过将 mysql 语句来查询导入了多少行数据到表中：

mysql> SELECT COUNT(*) FROM gi2uniprot;
+----------+
| COUNT(*) |
+----------+
| 40344363 |
+----------+
1 row in set (0.00 sec)

6. 运行blast2go

1. 图形化运行

运行出现了解决不了的问题，可以尝试切换一个全新的用户去运行blast2go。
使用b2g4pipe进行不用联网的运行：

$ java -Xmx1000m -cp *:ext/*: es.blast2go.Blast2GO

不使用b2g4pipe进行联网的运行：

曾经：
$ javaws http://www.blast2go.com/webstart/makeJnlp.php?mem=1000
现在：
$ javaws http://www.blast2go.com/webstart/blast2go1000.jnlp

2. 命令行运行

命令行运行使用了b2g4pipe, 先修改b2gPipe.properties中的数据库参数，再运行。需要先获得blast和interproscan的注释结果，再进行命令行运行。有两种运行的方法：

$ java -Xmx1000m -cp *:ext/*: es.blast2go.prog.B2GAnnotPipe
$ java -Xmx1000m -jar blast2go.jar

一	二	三	四	五	六	日
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

陈连福的生信博客

第22期培训班将于2024.01.27-2024.02.05期间在武汉市举办！

分类目录归档：生物信息学

BLAST+中makeblastdb参数详解

MegaBlast/Discontiguous MegaBlast/BlastN, Blastp/PSI-Blast/PHI-BLAST的区别与选择

BLAST本地化web运用

1 首先，到ftp://ftp.ncbi.nih.gov/blast/executables/release/LATEST/下载最新的版本的wwwblast（写此文时对应的文件为wwwblast-2.2.26-x64-linux.tar.gz）。

2 然后将该文件解压，不需要configure即可使用。设定所解压的文件夹到Apache路径。

3 wwwblast默认提供了2个test的数据库，分别为test_na_db和test_aa_db。

4 使用正常版的 makeblastdb 来创建数据库，并将数据库文件放入文件夹的db/路径下。

5. 修改 blast.html 文件。

6. 修改 blast.rc 配置文件。

5 这样，在网页中就能正常使用nt数据库了。

UniProtKB

blast2go本地数据库安装和使用

1. Download all necessary files

2. Unzip all files

4. Execute the file b2gdb.sql to create a database (default name: b2gdb), additional tables and a public user for restricted access (select only).

5. Import the last mapping file from PIR

6. 运行blast2go

1. 图形化运行

2. 命令行运行