This has been very important for variant calling.
Apply hard filters to a variant callset that is too small for VQSR or for which truth/training sets are not available.
Prerequisites
- TBD
Steps
- Extract the SNPs from the call set
- Determine parameters for filtering SNPs
- Apply the filter to the SNP call set
- Extract the Indels from the call set
- Determine parameters for filtering indels
- Apply the filter to the Indel call set
1. Extract the SNPs from the call set
Action
Run the following GATK command:
java -jar GenomeAnalysisTK.jar \
-T SelectVariants \
-R reference.fa \
-V raw_variants.vcf \
-L 20 \
-selectType SNP \
-o raw_snps.vcf
Expected Result
This creates a VCF file called
raw_snps.vcf
, containing just the SNPs from the original file of raw variants.2. Determine parameters for filtering SNPs
SNPs matching any of these conditions will be considered bad and filtered out, i.e.marked
FILTER
in the output VCF file. The program will specify which parameter was chiefly responsible for the exclusion of the SNP using the culprit annotation. SNPs that do not match any of these conditions will be considered good and marked PASS
in the output VCF file.- QualByDepth (QD) 2.0
This is the variant confidence (from the
QUAL
field) divided by the unfiltered depth of non-reference samples.- FisherStrand (FS) 60.0
Phred-scaled p-value using Fisher’s Exact Test to detect strand bias (the variation being seen on only the forward or only the reverse strand) in the reads. More bias is indicative of false positive calls.
- RMSMappingQuality (MQ) 40.0
This is the Root Mean Square of the mapping quality of the reads across all samples.
- HaplotypeScore 13.0
This is the consistency of the site with two (and only two) segregating haplotypes. Note that this is not applicable for calls made using the UnifiedGenotyper on non-diploid organisms.
- MappingQualityRankSumTest (MQRankSum) 12.5
This is the u-based z-approximation from the Mann-Whitney Rank Sum Test for mapping qualities (reads with ref bases vs. those with the alternate allele). Note that the mapping quality rank sum test can not be calculated for sites without a mixture of reads showing both the reference and alternate alleles, i.e. this will only be applied to heterozygous calls.
- ReadPosRankSumTest (ReadPosRankSum) 8.0
This is the u-based z-approximation from the Mann-Whitney Rank Sum Test for the distance from the end of the read for reads with the alternate allele. If the alternate allele is only seen near the ends of reads, this is indicative of error. Note that the read position rank sum test can not be calculated for sites without a mixture of reads showing both the reference and alternate alleles, i.e. this will only be applied to heterozygous calls.
3. Apply the filter to the SNP call set
Action
Run the following GATK command:
java -jar GenomeAnalysisTK.jar \
-T VariantFiltration \
-R reference.fa \
-V raw_snps.vcf \
--filterExpression "QD < 2.0 || FS > 60.0 || MQ < 40.0 || HaplotypeScore > 13.0 || MappingQualityRankSum < -12.5 || ReadPosRankSum < -8.0" \
--filterName "my_snp_filter" \
-o filtered_snps.vcf
Expected Result
This creates a VCF file called
filtered_snps.vcf
, containing all the original SNPs from the raw_snps.vcf
file, but now the SNPs are annotated with either PASS
or FILTER
depending on whether or not they passed the filters.
For SNPs that failed the filter, the variant annotation also includes the name of the filter. That way, if you apply several different filters (simultaneously or sequentially), you can keep track of which filter(s) each SNP failed, and later you can retrieve specific subsets of your calls using the SelectVariants tool. To learn more about composing different types of filtering expressions and retrieving subsets of variants using SelectVariants, please see the online GATK documentation.
4. Extract the Indels from the call set
Action
Run the following GATK command:
java -jar GenomeAnalysisTK.jar \
-T SelectVariants \
-R reference.fa \
-V raw_HC_variants.vcf \
-L 20 \
-selectType INDEL \
-o raw_indels.vcf
Expected Result
This creates a VCF file called
raw_indels.vcf
, containing just the Indels from the original file of raw variants.5. Determine parameters for filtering Indels.
Indels matching any of these conditions will be considered bad and filtered out, i.e.marked
FILTER
in the output VCF file. The program will specify which parameter was chiefly responsible for the exclusion of the indel using the culprit annotation. Indels that do not match any of these conditions will be considered good and marked PASS
in the output VCF file.- QualByDepth (QD) 2.0
This is the variant confidence (from the
QUAL
field) divided by the unfiltered depth of non-reference samples.- FisherStrand (FS) 200.0
Phred-scaled p-value using Fisher’s Exact Test to detect strand bias (the variation being seen on only the forward or only the reverse strand) in the reads. More bias is indicative of false positive calls.
- ReadPosRankSumTest (ReadPosRankSum) 20.0
This is the u-based z-approximation from the Mann-Whitney Rank Sum Test for the distance from the end of the read for reads with the alternate allele. If the alternate allele is only seen near the ends of reads, this is indicative of error. Note that the read position rank sum test can not be calculated for sites without a mixture of reads showing both the reference and alternate alleles, i.e. this will only be applied to heterozygous calls.
6. Apply the filter to the Indel call set
Action
Run the following GATK command:
java -jar GenomeAnalysisTK.jar \
-T VariantFiltration \
-R reference.fa \
-V raw_indels.vcf \
--filterExpression "QD < 2.0 || FS > 200.0 || ReadPosRankSum < -20.0" \
--filterName "my_indel_filter" \
-o filtered_indels.vcf
Expected Result
This creates a VCF file called
filtered_indels.vcf
, containing all the original Indels from the raw_indels.vcf
file, but now the Indels are annotated with either PASS
or FILTER
depending on whether or not they passed the filters.
For Indels that failed the filter, the variant annotation also includes the name of the filter. That way, if you apply several different filters (simultaneously or sequentially), you can keep track of which filter(s) each Indel failed, and later you can retrieve specific subsets of your calls using the SelectVariants tool. To learn more about composing different types of filtering expressions and retrieving subsets of variants using SelectVariants, please see the online GATK documentation.
- As indicated in the document on hard filtering. We don't normally recommend filtering on depth of coverage (see our default recommendations in the document linked above), but if you do you should keep those caveats in mind.
No comments:
Post a Comment
Note: only a member of this blog may post a comment.