Wednesday, 11 February 2015

Differences between WES and WGS is limiting the HC to an interval

Limiting your calls with HC to an interval, e.g.

-L a_bed_file_containing_intervals_of_the_exome.bed

exome captures are not 100% efficient. so i would guess that you are getting calls in regions of non-specifc hybridization as well as regions that are slightly outside the exome (due to probe design and/or sequencing chemistry/strategy). 

Apply hard filters to a call set

The following information comes from:

This has been very important for variant calling.

Apply hard filters to a variant callset that is too small for VQSR or for which truth/training sets are not available.

Prerequisites

  • TBD

Steps

  1. Extract the SNPs from the call set
  2. Determine parameters for filtering SNPs
  3. Apply the filter to the SNP call set
  4. Extract the Indels from the call set
  5. Determine parameters for filtering indels
  6. Apply the filter to the Indel call set

1. Extract the SNPs from the call set

Action

Run the following GATK command:
java -jar GenomeAnalysisTK.jar \ 
    -T SelectVariants \ 
    -R reference.fa \ 
    -V raw_variants.vcf \ 
    -L 20 \ 
    -selectType SNP \ 
    -o raw_snps.vcf 

Expected Result

This creates a VCF file called raw_snps.vcf, containing just the SNPs from the original file of raw variants.

2. Determine parameters for filtering SNPs

SNPs matching any of these conditions will be considered bad and filtered out, i.e.marked FILTER in the output VCF file. The program will specify which parameter was chiefly responsible for the exclusion of the SNP using the culprit annotation. SNPs that do not match any of these conditions will be considered good and marked PASS in the output VCF file.
  • QualByDepth (QD) 2.0
This is the variant confidence (from the QUAL field) divided by the unfiltered depth of non-reference samples.
  • FisherStrand (FS) 60.0
Phred-scaled p-value using Fisher’s Exact Test to detect strand bias (the variation being seen on only the forward or only the reverse strand) in the reads. More bias is indicative of false positive calls.
  • RMSMappingQuality (MQ) 40.0
This is the Root Mean Square of the mapping quality of the reads across all samples.
  • HaplotypeScore 13.0
This is the consistency of the site with two (and only two) segregating haplotypes. Note that this is not applicable for calls made using the UnifiedGenotyper on non-diploid organisms.
  • MappingQualityRankSumTest (MQRankSum) 12.5
This is the u-based z-approximation from the Mann-Whitney Rank Sum Test for mapping qualities (reads with ref bases vs. those with the alternate allele). Note that the mapping quality rank sum test can not be calculated for sites without a mixture of reads showing both the reference and alternate alleles, i.e. this will only be applied to heterozygous calls.
  • ReadPosRankSumTest (ReadPosRankSum) 8.0
This is the u-based z-approximation from the Mann-Whitney Rank Sum Test for the distance from the end of the read for reads with the alternate allele. If the alternate allele is only seen near the ends of reads, this is indicative of error. Note that the read position rank sum test can not be calculated for sites without a mixture of reads showing both the reference and alternate alleles, i.e. this will only be applied to heterozygous calls.

3. Apply the filter to the SNP call set

Action

Run the following GATK command:
java -jar GenomeAnalysisTK.jar \ 
    -T VariantFiltration \ 
    -R reference.fa \ 
    -V raw_snps.vcf \ 
    --filterExpression "QD < 2.0 || FS > 60.0 || MQ < 40.0 || HaplotypeScore > 13.0 || MappingQualityRankSum < -12.5 || ReadPosRankSum < -8.0" \ 
    --filterName "my_snp_filter" \ 
    -o filtered_snps.vcf 

Expected Result

This creates a VCF file called filtered_snps.vcf, containing all the original SNPs from the raw_snps.vcf file, but now the SNPs are annotated with either PASS or FILTERdepending on whether or not they passed the filters.
For SNPs that failed the filter, the variant annotation also includes the name of the filter. That way, if you apply several different filters (simultaneously or sequentially), you can keep track of which filter(s) each SNP failed, and later you can retrieve specific subsets of your calls using the SelectVariants tool. To learn more about composing different types of filtering expressions and retrieving subsets of variants using SelectVariants, please see the online GATK documentation.

4. Extract the Indels from the call set

Action

Run the following GATK command:
java -jar GenomeAnalysisTK.jar \ 
    -T SelectVariants \ 
    -R reference.fa \ 
    -V raw_HC_variants.vcf \ 
    -L 20 \ 
    -selectType INDEL \ 
    -o raw_indels.vcf 

Expected Result

This creates a VCF file called raw_indels.vcf, containing just the Indels from the original file of raw variants.

5. Determine parameters for filtering Indels.

Indels matching any of these conditions will be considered bad and filtered out, i.e.marked FILTER in the output VCF file. The program will specify which parameter was chiefly responsible for the exclusion of the indel using the culprit annotation. Indels that do not match any of these conditions will be considered good and marked PASS in the output VCF file.
  • QualByDepth (QD) 2.0
This is the variant confidence (from the QUAL field) divided by the unfiltered depth of non-reference samples.
  • FisherStrand (FS) 200.0
Phred-scaled p-value using Fisher’s Exact Test to detect strand bias (the variation being seen on only the forward or only the reverse strand) in the reads. More bias is indicative of false positive calls.
  • ReadPosRankSumTest (ReadPosRankSum) 20.0
This is the u-based z-approximation from the Mann-Whitney Rank Sum Test for the distance from the end of the read for reads with the alternate allele. If the alternate allele is only seen near the ends of reads, this is indicative of error. Note that the read position rank sum test can not be calculated for sites without a mixture of reads showing both the reference and alternate alleles, i.e. this will only be applied to heterozygous calls.

6. Apply the filter to the Indel call set

Action

Run the following GATK command:
java -jar GenomeAnalysisTK.jar \ 
    -T VariantFiltration \ 
    -R reference.fa \ 
    -V raw_indels.vcf \ 
    --filterExpression "QD < 2.0 || FS > 200.0 || ReadPosRankSum < -20.0" \ 
    --filterName "my_indel_filter" \ 
    -o filtered_indels.vcf 

Expected Result

This creates a VCF file called filtered_indels.vcf, containing all the original Indels from the raw_indels.vcf file, but now the Indels are annotated with either PASS or FILTER depending on whether or not they passed the filters.
For Indels that failed the filter, the variant annotation also includes the name of the filter. That way, if you apply several different filters (simultaneously or sequentially), you can keep track of which filter(s) each Indel failed, and later you can retrieve specific subsets of your calls using the SelectVariants tool. To learn more about composing different types of filtering expressions and retrieving subsets of variants using SelectVariants, please see the online GATK documentation.

  • As indicated in the document on hard filtering. We don't normally recommend filtering on depth of coverage (see our default recommendations in the document linked above), but if you do you should keep those caveats in mind.

Tuesday, 3 February 2015

我会为了你成就一番了不起的事业!

宝宝谢谢你对我的鼓励和信任.
我会为了你成就一番了不起的事业!

Sunday, 1 February 2015

美国投资者不好惹 阿里大麻烦在后面

美国的股市不比中国,美国的证监会也不是中国的证监会,美国的投资者当然更不是打掉牙往肚子咽的中国的投资者,更何况还有那么多无理搅三分的律师,那么多如秃鹫般的做空机构;所以,阿里的麻烦还在后面,不是一般的麻烦而是相当大的麻烦,水能载舟,亦能覆舟。
关于阿里巴巴和国家工商总局的撕逼大战,水皮想问的只有一句话,那就是,你们把投资者放在了哪里?或者说,你们心目中,我们投资者是什么东西?
有没有人回答都不重要。
市场会作出回答,那就是暴跌!
4.36%,这是阿里巴巴在美国纽交所挂牌以来的第二大跌幅,1月28日当天,阿里跳空低开整整2美元,收盘下跌4.49美元,不但再次跌破100美元大关,而且蒸发的市值高达111亿美元;与此同时,170多家在美挂牌的中概股有将近100家下跌,阿里巴巴会不会因此成为新一轮做空中概股的契机成为大家最担心的话题。
这不是一个小麻烦,对于工商总局的网监司司长刘红亮来讲是这样;这是一个大麻烦,对于阿里巴巴董事局主席马云来讲是这样。
问题不在于假货,尽管99%的讨论都会盯住这个问题不放,问题在于工商总局网监局在《中国工商报》,注意是《中国工商报》,而不是工商总局自己的网站上挂出的那份白皮书,这份东西能不能用白皮书的名义不重要,重要的是这份会议纪要显示在2014年7月有这么一个行政指导会议的存在,工商总局网监司联合北京、江苏、山东、广东、福建工商局网监机构负责人组成行政指导工作小组是奉总局之命,目的是促使阿里巴巴正视和解决阿里网络交易平台长期大量存在的违法经营问题。7月16日,这个行政指导小组联合浙江工商局和杭州工商局在浙的工商局召集座谈会,阿里相关高管到会接受行政指导。据白皮书称,为了不影响阿里的上市进程,座谈会以内部封闭形式举行,鉴于目前监管情势,为廊清种种认知,才公布真相。
2014年7月的这么一个重要的会议真相在半年之后才披露,而且是因为阿里店小二任意逼宫才无可奈何,不管原因是恩将仇报还是自相残杀,你让我们这些投资者情何以堪?工商局你是国家利益的看门狗,你有什么义务为一个外资公司的上市保驾护航?查处阿里平台的假货天经地义的事为什么搞得偷偷摸摸见不得人?难道真如工商小编所称“你们和阿里是队友”?监管不对消费者公开,监管的目的意义何在?阿里店小二可以质疑到司长情绪执法,我们是不是可以质疑工商选择性执法养痈遗患?这份迟到的白皮书算不算重大信息,如果算,现在才公布是不是构成隐瞒?如果不是,那么为什么现在公布且在公布之后又作大量删改替换?阿里是不是认为类似问题习以为常不需向投资者交待,这种闭门会议的合法性经得起推敲吗?如此,阿里的上市是否涉嫌欺诈呢?这恐怕才是事件的要害!
无独有偶的是,不早不晚,阿里的四季度财报公布,虽然营收有 40%的增长,但是利润则同比下滑28%,这样的财报对于已经被吊足胃口的华尔街来讲是个晴天霹雳,阿里的股价毫无悬念地再次暴跌,这一次的跌幅是创纪录的8.64%,盘中最大的跌幅一度达到13.66%,收盘价已经在90美元之后,再次跌破9月22日挂牌时的开盘价,而当天道琼斯大盘则上涨了1.3%,阿里的市值在一天之中又蒸发了200亿美元,而如果由最高的125美元计则更是达到惊人的600多亿美元。这些数字,在首富看来不过是纸上财富,涨涨跌跌、大大小小无关痛痒,但是对于在市场上追涨杀跌的投资者而言可能就是身家性命,搞好了一夜暴富,搞不好倾家荡产,所以,不可能不认真,不可能不较真,不可能让你伤害了我还一笑而过,或者阿里,或者工商,总有一个要成为被告,成为索赔的对象,成为撕逼大战的买单者。