Thursday 18 September 2014

Whome Exome Sequencing


PREPROCESS:
  • Index human genome (Picard), we used HG18 from UCSC.
  • Convert Illumina reads to Fastq format
  • Convert Illumina 1.6 read quality scores to standard Sanger scores
FOR EACH SAMPLE:
  1. Align samples to genome (BWA), generates SAI files.
  2. Convert SAI to SAM (BWA)
  3. Convert SAM to BAM binary format (SAM Tools)
  4. Sort BAM (SAM Tools)
  5. Index BAM (SAM Tools)
  6. Identify target regions for realignment (Genome Analysis Toolkit)
  7. Realign BAM to get better Indel calling (Genome Analysis Toolkit)
  8. Reindex the realigned BAM (SAM Tools)
  9. Call Indels (Genome Analysis Toolkit)
  10. Call SNPs (Genome Analysis Toolkit)
  11. View aligned reads in BAM/BAI (Integrated Genome Viewer)
SAMPLE SCRIPT
/bin/BWA/bwa aln -f /output/FOO.sai -t 3 /seq/REFERENCE/human_18.fasta /seq/FQ/FOO.sanger.fq
/bin/BWA/bwa samse -f /output/FOO.sam /seq/REFERENCE/human_18.fasta /output/FOO.sai /seq/FQ/FOO.sanger.fq
/bin/SAMTOOLS/samtools import /seq/REFERENCE/human_18.fasta /output/FOO.sam /output/FOO.bam
/bin/SAMTOOLS/samtools sort /output/FOO.bam /output/FOO.sorted
/bin/SAMTOOLS/samtools index /output/FOO.sorted.bam /output/FOO.sorted.bam.bai
java -jar /bin/GTK/GenomeAnalysisTK.jar -T RealignerTargetCreator -R /seq/REFERENCE/human_18.fasta -I /output/FOO.sorted.bam  -o /output/FOO.intervals
java -jar /bin/GTK/GenomeAnalysisTK.jar -T IndelRealigner -R /seq/REFERENCE/human_18.fasta -I /output/FOO.sorted.bam -targetIntervals /output/FOO.intervals --output /output/FOO.sorted.realigned.bam
/bin/SAMTOOLS/samtools index /output/FOO.sorted.realigned.bam /output/FOO.sorted.realigned.bam.bai
java -jar /bin/GTK/GenomeAnalysisTK.jar -T IndelGenotyperV2 -R /seq/REFERENCE/human_18.fasta -I /output/FOO.sorted.realigned.bam -O /output/FOO_indel.txt --verbose -o /output/FOO_indel_statistics.txt
java -jar /bin/GTK/GenomeAnalysisTK.jar -T UnifiedGenotyper -R /seq/REFERENCE/human_18.fasta -I /output/FOO.sorted.realigned.bam -varout /output/FOO.geli.calls -vf GELI -stand_call_conf 30.0 -stand_emit_conf 10.0 -pl SOLEXA 
 
 Between step 7 and 8, it is recommended to add an additional step:
samtools calmd -Abr FOO.sorted.bam human_18.fasta > FOO.baq.bam
This will substantially improve SNP specificity.
A few other comments to David Quigley's pipeline are:
  1. It is recommended to use Dindel for indel calling. It is the few indel callers that properly model indels (most not, including IndelGenotyperV2). Broad is reimplementing the Dindel model in GATK. The latest samtools also models indels properly, but it is not evaluated as widely as Dindel.
  2. For the human genome build 36, it is recommended to use the version used by the 1000 genomes project. You will lose genes in the pseudoautosomal regions if you use hg18. Hg19/b37 is fine.
  3. The samtools and GATK SNP callers are converging in algorithms and in results. GATK is arguably better than MAQ, which has been confirmed by several independent studies.
 

No comments:

Post a Comment

Note: only a member of this blog may post a comment.