结合GATK和SAMtools从头挖掘SNPs和INDELs

提出问题

给出以下的已有结果:
1. 物种species的基因组fasta文件: genome.fasta
2. 对该物种的一个名为sample的基因组DNA样,进行了Illumina hiseq2000 paired-end测序(300bp插入片段文库,~20x基因组碱基覆盖度),结果文件: reads1.fq, reads2.fq
问怎么进行SNPs和INDELs的calling?

文章有很多地方引自Nowind的博文:GATK使用简介 Part2/2
流程参照:http://www.broadinstitute.org/gatk/guide/topic?name=best-practices

以下给出具体的答案和步骤, 以上述已有结果的3个文件所在的目录为当前工作目录,各种软件的主目录以$software表示。多线程的程序以8个线程运行。

1. Raw reads的处理

使用NGSQC toolkit使用默认设置对raw reads进行过滤。

$ perl $NGSQCToolkitHome/QC/IlluQC_PRLL.pl \
  -pe reads1.fq reads2.fq 2 5 -c 8 \
  -o QC

2. 将过滤后的reads比对到基因组上

$ bowtie2-build genome.fasta genome
$ bowtie2 --rg-id sample --rg "PL:ILLUMINA" --rg "SM:sample" \
  -x geome -1 ./QC/reads1.fq_filtered -2 ./QC/reads2.fq.filtered \
  -p 8 -S sample.sam

3. 将sam文件转换为Bam文件并标记出PCR duplicates

$ samtools view -bS sample.sam > sample.bam
$ java -Xmx10g -jar $picardHome/SortSam.jar \
  INPUT=sample.bam OUTPUT=sample.sort.bam \
  SORT_ORDER=coordinate
$ java -Xmx10g -jar $picardHome/MarDuplicates.jar \
  INPUT=sample.sort.bam OUTPUT=sample.dd.bam \
  METRICS_FILE=sample.dd.metrics \
  MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=1000

4. 构建索引文件

$ samtools faidx genome.fasta
$ java -Xmx10g -jar $picardHome/CreateSequenceDictionary.jar \
  R=genome.fasta O=genome.dict
$ samtools index sample.dd.bam

5. 对INDEL周围进行重新比对(realignment)

$ java -Xmx10g -jar $GATKHome/GenomeAnalysisTK.jar \
  -R genome.fasta -T RealignerTargetCreator \
  -I sample.dd.bam -o sample.realn.intervals
$ java -Xmx10g -jar $GATKHome/GenomeAnalysisTK.jar \
  -R genome.fasta -T IndelRealigner \
  -targetIntervals sample.realn.intervals \
  -I sample.dd.bam -o sample.realn.bam

6. 第1遍Variant calling

6.1 使用GATK和samtools进行SNP和INDEL calling

$ java -Xmx10g -jar $GATKHome/GenomeAnalysisTK.jar \
  -R genome.fasta -T UnifiedGenotyper \
  -I sample.realn.bam -o sample.gatk.raw1.vcf \
  --read_filter BadCigar -glm BOTH \
  -stand_call_conf 30.0 -stand_emit_conf 0
$ samtools mpileup -DSugf genome.fasta sample.realn.bam | \
  bcftools view -Ncvg - > sample.samtools.raw1.vcf

6.2 对Variant结果进行综合和过滤

$ java -Xmx10g -jar $GATKHome/GenomeAnalysisTK.jar \
  -R genome.fasta -T SelectVariants \
  --variant sample.gatk.raw1.vcf \
  --concordance sample.samtools.raw1.vcf \
  -o sample.concordance.raw1.vcf
$ MEANQUAL=`awk '{ if ($1 !~ /#/) { total += $6; count++ } } \
  END { print total/count }' sample.concordance.raw1.vcf`
$ java -Xmx10g -jar $GATKHome/GenomeAnalysisTK.jar \
  -R genome.fasta -T VariantFiltration \
  --filterExpression "QD < 20.0 || ReadPosRankSum < -8.0 || \
  FS > 10.0 || QUAL < $MEANQUAL" \
  --filterName LowQualFilter --variant sample.concordance.raw1.vcf \
  --missingValuesInExpressionsShouldEvaluateAsFailing
  --logging_level ERROR -o sample.concordance.flt1.vcf
$ grep -v "Filter" sample.concordance.flt1.vcf > \
  sample.concordance.filter1.vcf

7. 对第5步获得的realign的bam文件的进行校正

$ java -Xmx10g -jar $GATKHome/GenomeAnalysisTK.jar \
  -R genome.fasta -T BaseRecalibrator \
  -I sample.realn.bam -o sample.recal_data.grp \
  -knownSites sample.concordance.filter1.vcf
$ java -Xmx10g -jar $GATKHome/GenomeAnalysisTK.jar \
  -R genome.fasta -T PrintReads \
  -I sample.realn.bam -o sample.recal.bam \
  -BQSR sample.recal_data.grp

8. 第2遍Variant calling

$ java -Xmx10g -jar $GATKHome/GenomeAnalysisTK.jar \
  -R genome.fasta -T UnifiedGenotyper \
  -I sample.recal.bam -o sample.gatk.raw2.vcf \
  --read_filter BadCigar -glm BOTH \
  -stand_call_conf 30.0 -stand_emit_conf 0
$ samtools mpileup -DSugf genome.fasta sample.recal.bam | \
  bcftools view -Ncvg - > sample.samtools.raw2.vcf
$ java -Xmx10g -jar $GATKHome/GenomeAnalysisTK.jar \
  -R genome.fasta -T SelectVariants \
  --variant sample.gatk.raw2.vcf \
  --concordance sample.samtools.raw2.vcf \
  -o sample.concordance.raw2.vcf
$ java -Xmx10g -jar $GATKHome/GenomeAnalysisTK.jar \
  -R genome.fasta -T VariantFiltration \
  --filterExpression "QD < 10.0 || ReadPosRankSum < -8.0 || \
  FS > 10.0 || QUAL < 30" \
  --filterName LowQualFilter --variant sample.concordance.raw2.vcf \
  --missingValuesInExpressionsShouldEvaluateAsFailing
  --logging_level ERROR -o sample.concordance.flt2.vcf
$ grep -v "Filter" sample.concordance.flt2.vcf > \
  sample.concordance.filter2.vcf
$ cp sample.concordance.filter2.vcf sample.final.vcf

最后的结果文件为sample.final.vcf。关于具体的vcf的格式详解见:http://www.hzaumycology.com/chenlianfu_blog/?p=1514

GATK UnifiedGenotyper用于Variant calling

懒人先看:

* 标示的为最常用的参数。

$ java -Xmx20g -jar GenomeAnalysisTK.jar \  #使用GATK主程序
*    -R ref.fasta \  #参考序列
*    -T UnifiedGenotyper \  #使用GATK的该程序
*    -I sample1.bam [-I sample2.bam] [-I ...]\  #输入的bam比对结果
     --dbsnp dbSNP.vcf \  #使用该文件中的variants ID加入到结果文件中
     --genotyping_mode GENOTYPE_GIVEN_ALLELES --allels know.vcf \  #只对已知的variant进行基因分型
*    --genotype_likelihoods_model [SNP/INDEL/BOTH] \  #进行SNP、INDEl或者两者同时的calling
     --min_base_quality_score 17 \  #variant calling时碱基质量的最低要求。但是INDEL calling时,该值无效。而是去除reads的低质量3'端直到Q20。用下面的方法来鉴定
     --min_indel_count_for_genotyping 5 \  #至少有这么多的reads在某一位点表现出了indel
     --min_indel_fraction_per_sample 0.25 \  #至少有这么大比例的reads在某一点表现出了indel
*    --out gatk.cvf \  #输出到文件,否则为标准输出
*    --output_mode [EMIT_VARIANTS_ONLY/EMIT_ALL_CONFIDENT_SITES/EMIT_ALL_SITES] \  #输出模式,默认只输出variant位点
     --sample_ploidy 2 \  #样品的倍型
*    --standard_min_confidence_threshold_for_calling 30.0 \  #设定variant位点的置信阈值,低于该阈值则为low quality
*    --standard_min_confidence_threshold_for_emitting 30.0 \  #在vcf结果文件中,低于该值的位点则不进行报告

1. UnifiedGenotyper简介

UnifiedGenotyper是GATK(Genome Analysis ToolKit)中一个主要工具,用于Variant calling。在GATK网站上这样描述它:A variant caller which unifies the approaches of several disparate callers — Works for single-sample and multi-sample data.

UnifiedGenotyper能对单个或多个sample进行SNP和INDEL calling。使用Beyesian genotype likelihood model来对N个sample进行基因型的判断和等位基因频率的计算。

2. 命令的简单使用和输入要求

一个使用UnifiedGenotyper对多个samples进行SNP calling的例子如下:

$ java -jar GenomeAnalysisTK.jar \
    -R resources/Homo_sapiens_assembly18.fasta \
    -T UnifiedGenotyper \
    -I sample1.bam [-I sample2.bam] \
    --dbsnp dbSNP.vcf \
    -o snps.raw.vcf \
    -stand_call_conf [50.0] \
    -stand_emit_conf 10.0 \
    -dcov [50 for 4x, 200 for >30x WGS or Whole exome] \
    [-L targets.interval_list]

该上述命令中输入的文件有2个:参考序列的fasta文件 和 bam格式的比对结果文件。但是只有这两个文件是不行的,其实额外需要fasta文件的dict文件和fai文件,以及bam文件的bai文件。需要使用picard和samtools软件,命令行如下:

$ java -jar $picardHome/CreateSequenceDictionary.jar R=./resources/Homo_sapiens_assembly18.fasta O=./resources/Homo_sapiens_assembly18.dict
$ samtools faidx ./resources/Homo_sapiens_assembly18.fasta
$ samtools index sample1.bam

3. 附加信息

3.1 Read filters

在运行UnifiedGenotyper的时候,会对reads自动进行过滤,使用的相关命令有:
DuplicateReadFilter
FailsVendorQualityCheckFilter
NotPrimaryAlignmentFilter
MalformedReadFilter
BadMateFilter
MappingQualityUnavailableFilter
UnmappedReadFilter

3.2 并行化运算

该程序能进行多线程运算,使用如下参数即可:
NanoSchedulable (-nct)
CPU数。对每个data的运行能使用的CPU数。即对一个数据执行的一个命令进行计算所能用的CPU数。
TreeReducible (-nt)
线程数。将一个总的data分成多少份,然后分别对每个data使用单独的命令进行运算,最大值为24. 最后合并结果可能消耗额外的时间;消耗的内存也按倍数增加。

3.3 Downsampling 设置

This tool overrides the engine’s default downsampling settings.

Mode: BY_SAMPLE
To coverage: 250

3.4 Window size

This tool uses a sliding window on the reference.

Window start: -200 bp before the locus
Window stop: 200 bp after the locus

4. UnifiedGenotyper的参数

4.1 GATK参数

使用了GATK共有的参数:CommandLineGATK.

4.2 UnifiedGenotyper特有的参数

  • --alleles / -alleles RodBinding[VariantContext] default: none

The set of alleles at which to genotype when –genotyping_mode is GENOTYPE_GIVEN_ALLELES. When the UnifiedGenotyper is put into GENOTYPE_GIVEN_ALLELES mode it will genotype the samples using only the alleles provide in this rod binding –alleles binds reference ordered data. This argument supports ROD files of the following types: BCF2, VCF, VCF3

  • --annotateNDA / -nda default: false

If provided, we will annotate records with the number of alternate alleles that were discovered (but not necessarily genotyped) at a given site. Depending on the value of the –max_alternate_alleles argument, we may genotype only a fraction of the alleles being sent on for genotyping. Using this argument instructs the genotyper to annotate (in the INFO field) the number of alternate alleles that were originally discovered at the site.

  • --annotation / -A List[String] default: none

One or more specific annotations to apply to variant calls. Which annotations to add to the output VCF file. See the VariantAnnotator -list argument to view available annotations.

  • --comp / -comp List[RodBinding[VariantContext]] default: none

Comparison VCF file. If a call overlaps with a record from the provided comp track, the INFO field will be annotated as such in the output with the track name (e.g. -comp:FOO will have ‘FOO’ in the INFO field). Records that are filtered in the comp track will be ignored. Note that ‘dbSNP’ has been special-cased (see the –dbsnp argument). –comp binds reference ordered data. This argument supports ROD files of the following types: BCF2, VCF, VCF3

  • --computeSLOD / -slod boolean default: false

If provided, we will calculate the SLOD (SB annotation). Note that calculating the SLOD increases the runtime by an appreciable amount.

  • -contamination / --contamination_fraction_to_filter double default: 0.05

Fraction of contamination in sequencing data (for all samples) to aggressively remove. If this fraction is greater is than zero, the caller will aggressively attempt to remove contamination through biased down-sampling of reads. Basically, it will ignore the contamination fraction of reads for each alternate allele. So if the pileup contains N total bases, then we will try to remove (N * contamination fraction) bases for each alternate allele.

  • -contaminationFile / --contamination_fraction_per_sample_file File

Tab-separated File containing fraction of contamination in sequencing data (per sample) to aggressively remove. Format should be “” (Contamination is double) per line; No header.. This argument specifies a file with two columns “sample” and “contamination” specifying the contamination level for those samples. Samples that do not appear in this file will be processed with CONTAMINATION_FRACTION

  • --dbsnp / -D RodBinding[VariantContext] default: none

dbSNP file. rsIDs from this file are used to populate the ID column of the output. Also, the DB INFO flag will be set when appropriate. dbSNP is not used in any way for the calculations themselves. –dbsnp binds reference ordered data. This argument supports ROD files of the following types: BCF2, VCF, VCF3

  • --excludeAnnotation / -XA List[String] default: none

One or more specific annotations to exclude. Which annotations to exclude from output in the VCF file. Note that this argument has higher priority than the -A or -G arguments, so annotations will be excluded even if they are explicitly included with the other options.

  • --genotype_likelihoods_model / -glm Model default: SNP

Genotype likelihoods calculation model to employ — SNP is the default option, while INDEL is also available for calling indels and BOTH is available for calling both together.
The –genotype_likelihoods_model argument is an enumerated type (Model), which can have one of the following values:
SNP
INDEL
GENERALPLOIDYSNP
GENERALPLOIDYINDEL
BOTH

  • --genotyping_mode / -gt_mode GENOTYPING_MODE default: DISCOVERY

Specifies how to determine the alternate alleles to use for genotyping.
The –genotyping_mode argument is an enumerated type (GENOTYPING_MODE), which can have one of the following values:
DISCOVERY
the Unified Genotyper will choose the most likely alternate allele
GENOTYPE_GIVEN_ALLELES
only the alleles passed in from a VCF rod bound to the -alleles argument will be used for genotyping

  • --group / -G String[] default: [Standard]

One or more classes/groups of annotations to apply to variant calls. If specified, all available annotations in the group will be applied. See the VariantAnnotator -list argument to view available groups. Keep in mind that RODRequiringAnnotations are not intended to be used as a group, because they require specific ROD inputs.

  • --heterozygosity / -hets Double default: 0.0010

Heterozygosity value used to compute prior likelihoods for any locus. The expected heterozygosity value used to compute prior likelihoods for any locus. The default priors are: het = 1e-3, P(hom-ref genotype) = 1 – 3 * het / 2, P(het genotype) = het, P(hom-var genotype) = het / 2

  • --indel_heterozygosity / -indelHeterozygosity double default: 1.25E-4

Heterozygosity for indel calling. This argument informs the prior probability of having an indel at a site.

  • --indelGapContinuationPenalty / -indelGCP byte default: 10

Indel gap continuation penalty, as Phred-scaled probability. I.e., 30 => 10^-30/10.

  • --indelGapOpenPenalty / -indelGOP byte default: 45

Indel gap open penalty, as Phred-scaled probability. I.e., 30 => 10^-30/10.

  • --input_prior / -inputPrior List[Double] default: none

Input prior for calls. By default, the prior specified with the argument –heterozygosity/-hets is used for variant discovery at a particular locus, using an infinite sites model, see e.g. Waterson (1975) or Tajima (1996). This model asserts that the probability of having a population of k variant sites in N chromosomes is proportional to theta/k, for 1=1:N There are instances where using this prior might not be desireable, e.g. for population studies where prior might not be appropriate, as for example when the ancestral status of the reference allele is not known. By using this argument, user can manually specify priors to be used for calling as a vector for doubles, with the following restriciotns: a) User must specify 2N values, where N is the number of samples. b) Only diploid calls supported. c) Probability values are specified in double format, in linear space. d) No negative values allowed. e) Values will be added and Pr(AC=0) will be 1-sum, so that they sum up to one. f) If user-defined values add to more than one, an error will be produced. If user wants completely flat priors, then user should specify the same value (=1/(2*N+1)) 2*N times,e.g. -inputPrior 0.33 -inputPrior 0.33 for the single-sample diploid case.

  • --max_alternate_alleles / -maxAltAlleles int default: 6

Maximum number of alternate alleles to genotype. If there are more than this number of alternate alleles presented to the genotyper (either through discovery or GENOTYPE_GIVEN ALLELES), then only this many alleles will be used. Note that genotyping sites with many alternate alleles is both CPU and memory intensive and it scales exponentially based on the number of alternate alleles. Unless there is a good reason to change the default value, we highly recommend that you not play around with this parameter. As of GATK 2.2 the genotyper can handle a very large number of events, so the default maximum has been increased to 6.

  • --max_deletion_fraction / -deletions Double default: 0.05

Maximum fraction of reads with deletions spanning this locus for it to be callable [to disable, set to < 0 or > 1; default:0.05].

  • --min_base_quality_score / -mbq int default: 17

Minimum base quality required to consider a base for calling. The minimum confidence needed in a given base for it to be used in variant calling. Note that the base quality of a base is capped by the mapping quality so that bases on reads with low mapping quality may get filtered out depending on this value. Note too that this argument is ignored in indel calling. In indel calling, low-quality ends of reads are clipped off (with fixed threshold of Q20).

  • -minIndelCnt / --min_indel_count_for_genotyping int default: 5

Minimum number of consensus indels required to trigger genotyping run. A candidate indel is genotyped (and potentially called) if there are this number of reads with a consensus indel at a site. Decreasing this value will increase sensitivity but at the cost of larger calling time and a larger number of false positives.

  • -minIndelFrac / --min_indel_fraction_per_sample double default: 0.25

Minimum fraction of all reads at a locus that must contain an indel (of any allele) for that sample to contribute to the indel count for alleles. Complementary argument to minIndelCnt. Only samples with at least this fraction of indel-containing reads will contribute to counting and overcoming the threshold minIndelCnt. This parameter ensures that in deep data you don’t end up summing lots of super rare errors up to overcome the 5 read default threshold. Should work equally well for low-coverage and high-coverage samples, as low coverage samples with any indel containing reads should easily over come this threshold.

  • --out / -o VariantContextWriter default: stdout

File to which variants should be written. A raw, unfiltered, highly sensitive callset in VCF format.

  • --output_mode / -out_mode OUTPUT_MODE default: EMIT_VARIANTS_ONLY

Specifies which type of calls we should output.
The –output_mode argument is an enumerated type (OUTPUT_MODE), which can have one of the following values:
EMIT_VARIANTS_ONLY
produces calls only at variant sites
EMIT_ALL_CONFIDENT_SITES
produces calls at variant sites and confident reference sites
EMIT_ALL_SITES
produces calls at any callable site regardless of confidence; this argument is intended only for point mutations (SNPs) in DISCOVERY mode or generally when running in GENOTYPE_GIVEN_ALLELES mode; it will by no means produce a comprehensive set of indels in DISCOVERY mode

  • --pair_hmm_implementation / -pairHMM HMM_IMPLEMENTATION default: ORIGINAL

The PairHMM implementation to use for -glm INDEL genotype likelihood calculations. The PairHMM implementation to use for -glm INDEL genotype likelihood calculations. The various implementations balance a tradeoff of accuracy and runtime.
The –pair_hmm_implementation argument is an enumerated type (HMM_IMPLEMENTATION), which can have one of the following values:
EXACT
ORIGINAL
LOGLESS_CACHING

  • --pcr_error_rate / -pcr_error Double default: 1.0E-4

The PCR error rate to be used for computing fragment-based likelihoods. The PCR error rate is independent of the sequencing error rate, which is necessary because we cannot necessarily distinguish between PCR errors vs. sequencing errors. The practical implication for this value is that it effectively acts as a cap on the base qualities.

  • --sample_ploidy / -ploidy int default: 2

Plody (number of chromosomes) per sample. For pooled data, set to (Number of samples in each pool * Sample Ploidy)..

  • -stand_call_conf / --standard_min_confidence_threshold_for_calling double default: 30.0

The minimum phred-scaled confidence threshold at which variants should be called. The minimum phred-scaled Qscore threshold to separate high confidence from low confidence calls. Only genotypes with confidence >= this threshold are emitted as called sites. A reasonable threshold is 30 for high-pass calling (this is the default).

  • -stand_emit_conf / --standard_min_confidence_threshold_for_emitting double default: 30.0

The minimum phred-scaled confidence threshold at which variants should be emitted (and filtered with LowQual if less than the calling threshold). This argument allows you to emit low quality calls as filtered records.

VCF格式详解

VCF格式:Vriant Call Format.关于其详细描述,请见GATK的FAQ:How should I interpret VCF files produced by the GATK?

1. 什么是VCF?

CVF是用于描述SNP,INDEL和SV结果的文本文件。在GATK软件中得到最好的支持,当然SAMtools得到的结果也是CVF格式,和GATK的CVF格式有点差别。

2. VCF的主体结构

先给出一个VCF文件的范例:

##fileformat=VCFv4.0
##FILTER=<ID=LowQual,Description="QUAL < 50.0">
##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth (only filtered reads used for calling)">
##FORMAT=<ID=GQ,Number=1,Type=Float,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=PL,Number=3,Type=Float,Description="Normalized, Phred-scaled likelihoods for AA,AB,BB genotypes where A=ref and B=alt; not applicable if site is not biallelic">
##INFO=<ID=AC,Number=.,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP Membership">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=DS,Number=0,Type=Flag,Description="Were any of the samples downsampled?">
##INFO=<ID=Dels,Number=1,Type=Float,Description="Fraction of Reads Containing Spanning Deletions">
##INFO=<ID=HRun,Number=1,Type=Integer,Description="Largest Contiguous Homopolymer Run of Variant Allele In Either Direction">
##INFO=<ID=HaplotypeScore,Number=1,Type=Float,Description="Consistency of the site with two (and only two) segregating haplotypes">
##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">
##INFO=<ID=MQ0,Number=1,Type=Integer,Description="Total Mapping Quality Zero Reads">
##INFO=<ID=QD,Number=1,Type=Float,Description="Variant Confidence/Quality by Depth">
##INFO=<ID=SB,Number=1,Type=Float,Description="Strand Bias">
##INFO=<ID=VQSLOD,Number=1,Type=Float,Description="log10-scaled probability of variant being true under the trained gaussian mixture model">
##UnifiedGenotyperV2="analysis_type=UnifiedGenotyperV2 input_file=[TEXT CLIPPED FOR CLARITY]"
#CHROM  POS ID      REF ALT QUAL    FILTER  INFO    FORMAT  NA12878
chr1    873762  .       T   G   5231.78 PASS    AC=1;AF=0.50;AN=2;DP=315;Dels=0.00;HRun=2;HaplotypeScore=15.11;MQ=91.05;MQ0=15;QD=16.61;SB=-1533.02;VQSLOD=-1.5473 GT:AD:DP:GQ:PL   0/1:173,141:282:99:255,0,255
chr1    877664  rs3828047   A   G   3931.66 PASS    AC=2;AF=1.00;AN=2;DB;DP=105;Dels=0.00;HRun=1;HaplotypeScore=1.59;MQ=92.52;MQ0=4;QD=37.44;SB=-1152.13;VQSLOD= 0.1185 GT:AD:DP:GQ:PL  1/1:0,105:94:99:255,255,0
chr1    899282  rs28548431  C   T   71.77   PASS    AC=1;AF=0.50;AN=2;DB;DP=4;Dels=0.00;HRun=0;HaplotypeScore=0.00;MQ=99.00;MQ0=0;QD=17.94;SB=-46.55;VQSLOD=-1.9148 GT:AD:DP:GQ:PL  0/1:1,3:4:25.92:103,0,26
chr1    974165  rs9442391   T   C   29.84   LowQual AC=1;AF=0.50;AN=2;DB;DP=18;Dels=0.00;HRun=1;HaplotypeScore=0.16;MQ=95.26;MQ0=0;QD=1.66;SB=-0.98 GT:AD:DP:GQ:PL  0/1:14,4:14:60.91:61,0,255

从范例上看,VCF文件分为两部分内容:以“#”开头的注释部分;没有“#”开头的主体部分。

值得注意的是,注释部分有很多对VCF的介绍信息。实际上不需要本文章,只是看看这个注释部分就完全明白了VCF各行各列代表的意义。我们先讲VCF文件主题部分的结构,如下所示:

[HEADER LINES]
#CHROM  POS ID      REF ALT QUAL    FILTER  INFO          FORMAT          NA12878
chr1    873762  .       T   G   5231.78 PASS    [ANNOTATIONS] GT:AD:DP:GQ:PL  0/1:173,141:282:99:255,0,255
chr1    877664  rs3828047   A   G   3931.66 PASS    [ANNOTATIONS] GT:AD:DP:GQ:PL  1/1:0,105:94:99:255,255,0
chr1    899282  rs28548431  C   T   71.77   PASS    [ANNOTATIONS] GT:AD:DP:GQ:PL  0/1:1,3:4:25.92:103,0,26
chr1    974165  rs9442391   T   C   29.84   LowQual [ANNOTATIONS] GT:AD:DP:GQ:PL  0/1:14,4:14:60.91:61,0,255

以上去掉了头部的注释行,只留下了代表每一行意义的注释行。主体部分中每一行代表一个Variant的信息。

3. 怎么解释Variation

CHROM 和 POS:代表参考序列名和variant的位置;如果是INDEL的话,位置是INDEL的第一个碱基位置。

ID:variant的ID。比如在dbSNP中有该SNP的id,则会在此行给出;若没有,则用’.’表示其为一个novel variant。

REF 和 ALT:参考序列的碱基 和 Variant的碱基。

QUAL:Phred格式(Phred_scaled)的质量值,表示在该位点存在variant的可能性;该值越高,则variant的可能性越大;计算方法:Phred值 = -10 * log (1-p) p为variant存在的概率; 通过计算公式可以看出值为10的表示错误概率为0.1,该位点为variant的概率为90%。

FILTER:使用上一个QUAL值来进行过滤的话,是不够的。GATK能使用其它的方法来进行过滤,过滤结果中通过则该值为”PASS”;若variant不可靠,则该项不为”PASS”或”.”。

INFO: 这一行是variant的详细信息,内容很多,以下再具体详述。

FORMAT 和 NA12878:这两行合起来提供了’NA12878’这个sample的基因型的信息。’NA12878’代表这该名称的样品,是由BAM文件中的@RG下的 SM 标签决定的。

4. 基因型信息

chr1    873762  .       T   G   [CLIPPED] GT:AD:DP:GQ:PL    0/1:173,141:282:99:255,0,255
chr1    877664  rs3828047   A   G   [CLIPPED] GT:AD:DP:GQ:PL    1/1:0,105:94:99:255,255,0
chr1    899282  rs28548431  C   T   [CLIPPED] GT:AD:DP:GQ:PL    0/1:1,3:4:25.92:103,0,26

看上面最后两列数据,这两列数据是对应的,前者为格式,后者为格式对应的数据。

GT:样品的基因型(genotype)。两个数字中间用’/’分开,这两个数字表示双倍体的sample的基因型。0 表示样品中有ref的allele; 1 表示样品中variant的allele; 2表示有第二个variant的allele。因此: 0/0 表示sample中该位点为纯合的,和ref一致; 0/1 表示sample中该位点为杂合的,有ref和variant两个基因型; 1/1 表示sample中该位点为纯合的,和variant一致。

AD 和 DP:AD(Allele Depth)为sample中每一种allele的reads覆盖度,在diploid中则是用逗号分割的两个值,前者对应ref基因型,后者对应variant基因型; DP(Depth)为sample中该位点的覆盖度。

GQ:基因型的质量值(Genotype Quality)。Phred格式(Phred_scaled)的质量值,表示在该位点该基因型存在的可能性;该值越高,则Genotype的可能性越大;计算方法:Phred值 = -10 * log (1-p) p为基因型存在的概率。

PL:指定的三种基因型的质量值(provieds the likelihoods of the given genotypes)。这三种指定的基因型为(0/0,0/1,1/1),这三种基因型的概率总和为1。和之前不一致,该值越大,表明为该种基因型的可能性越小。 Phred值 = -10 * log (p) p为基因型存在的概率。

5. VCF第8列的信息

该列信息最多了,都是以 “TAG=Value”,并使用”;”分隔的形式。其中很多的注释信息在VCF文件的头部注释中给出。以下是这些TAG的解释

AC,AF 和 AN:AC(Allele Count) 表示该Allele的数目;AF(Allele Frequency) 表示Allele的频率; AN(Allele Number) 表示Allele的总数目。对于1个diploid sample而言:则基因型 0/1 表示sample为杂合子,Allele数为1(双倍体的sample在该位点只有1个等位基因发生了突变),Allele的频率为0.5(双倍体的sample在该位点只有50%的等位基因发生了突变),总的Allele为2; 基因型 1/1 则表示sample为纯合的,Allele数为2,Allele的频率为1,总的Allele为2。

DP:reads覆盖度。是一些reads被过滤掉后的覆盖度。

Dels:Fraction of Reads Containing Spanning Deletions。进行SNP和INDEL calling的结果中,有该TAG并且值为0表示该位点为SNP,没有则为INDEL。

FS:使用Fisher’s精确检验来检测strand bias而得到的Fhred格式的p值。该值越小越好。一般进行filter的时候,可以设置 FS < 10~20。

HaplotypeScore:Consistency of the site with at most two segregating haplotypes

InbreedingCoeff:Inbreeding coefficient as estimated from the genotype likelihoods per-sample when compared against the Hard-Weinberg expectation

MLEAC:Maximum likelihood expectation (MLE) for the allele counts (not necessarily the same as the AC), for each ALT allele, in the same order as listed

MLEAF:Maximum likelihood expectation (MLE) for the allele frequency (not necessarily the same as the AF), for each ALT alle in the same order as listed

MQ:RMS Mapping Quality

MQ0:Total Mapping Quality Zero Reads

MQRankSum:Z-score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities

QD:Variant Confidence/Quality by Depth

RPA:Number of times tandem repeat unit is repeated, for each allele (including reference)

RU:Tandem repeat unit (bases)

ReadPosRankSum:Z-score from Wilcoxon rank sum test of Alt vs. Ref read position bias

STR:Variant is a short tandem repeat