Exome sequencing is performed on all HipSci iPS cell lines that are selected for banking after passing QC. Sequencing and primary analysis are performed at the Wellcome Trust Sanger Institue.
HipSci’s exome-seq analysis pipeline comprises these steps:
- Map sequence reads to the human GRCh37 reference with BWA
- Post-alignment improvements:
- Realign around known indels using GATK
- Recalibrate base quality score recalibration using GATK
- Merge sequencing runs from the same cell line.
- Calculate bq score with samtools
- Call variants using samtools mpileup and bcftools
- Phase haplotypes using SHAPEIT
- Impute genotypes using IMPUTE2
The following filters are applied to called variants:
- minimum read depth DP<=4
- maximum read depth DP>2000
- minimum mapping quality MQ<=25
- minimum quality for SNPs QUAL<=30
- minimum quality for indels QUAL<=60
Getting the data
- Raw sequencing reads – Distributed in the cram file format. Any cell line can have multiple associated cram files; each corresponds to a single lane of sequencing.
- BWA alignment – Distributed in the bam file format. These are the input files used for variant calling, after the post-alignment improvements. We distribute one bam file per cell line.
- Mpileup variant calls – Distributed in vcf file format. These are the genotypes called directly from the aligned sequence, before phasing or imputation. We distribute a single-sample vcf file for each cell line.
- Imputed and phased genotypes – Distributed in vcf file format. These contain the output from SHAPEIT and IMPUTE2. We distribute a single-sample vcf file for each cell line, containing genotypes imputed to the 1000genomes and UK10K reference panels.
HipSci’s FTP site contains: