Exome sequencing is performed on all HipSci iPS cell lines that are selected for banking
after passing QC. Sequencing and primary analysis are performed at the
Wellcome Trust Sanger Institue.
HipSci’s exome-seq analysis pipeline comprises these steps:
- Map sequence reads to the human GRCh37 reference with BWA
- Post-alignment improvements:
- Realign around known indels using GATK
- Recalibrate base quality score recalibration using GATK
- Merge sequencing runs from the same cell line.
- Calculate bq score with samtools
- Call variants using samtools mpileup and bcftools
- Phase haplotypes using SHAPEIT
- Impute genotypes using IMPUTE2
The following filters are applied to called variants:
- minimum read depth DP<=4
- maximum read depth DP>2000
- minimum mapping quality MQ<=25
- minimum quality for SNPs QUAL<=30
- minimum quality for indels QUAL<=60
Getting the data
Complete lists of exome-seq data can be found under the files tab of
the cell lines and data browser
or in the dataset indexes on the FTP site.
- Raw sequencing reads
– Distributed in the cram file format. Any cell line
can have multiple associated cram files; each corresponds to a single lane of sequencing.
- BWA alignment
– Distributed in the bam file format. These are the input files used for variant calling, after the post-alignment improvements.
We distribute one bam file per cell line.
- Mpileup variant calls
– Distributed in vcf file format. These are the genotypes called directly from the aligned sequence, before phasing or imputation.
We distribute a single-sample vcf file for each cell line.
- Imputed and phased genotypes
– Distributed in vcf file format. These contain the output
from SHAPEIT and IMPUTE2. We distribute a single-sample vcf file for each cell line, containing
genotypes imputed to the 1000genomes and UK10K reference panels.
For managed access cell lines, exome-seq
files are archived in the EGA. The
data browser contains
links to the relevant EGA dataset page, from where researchers can request access to the data.
For open access cell lines, exome-seq files
are archived in ENA. Data are openly available
to anybody, and the data browser
contains direct links to the files on the ENA FTP server.
HipSci’s FTP site contains: