Exome sequencing

Exome sequencing is performed on all HipSci iPS cell lines that are selected for banking after passing QC. Sequencing and primary analysis are performed at the Wellcome Trust Sanger Institue.

Primary analysis

HipSci’s exome-seq analysis pipeline comprises these steps:

  1. Map sequence reads to the human GRCh37 reference with BWA
  2. Post-alignment improvements:
    • Realign around known indels using GATK
    • Recalibrate base quality score recalibration using GATK
    • Merge sequencing runs from the same cell line.
    • Calculate bq score with samtools
  3. Call variants using samtools mpileup and bcftools
  4. Phase haplotypes using SHAPEIT
  5. Impute genotypes using IMPUTE2

The following filters are applied to called variants:

  • minimum read depth DP<=4
  • maximum read depth DP>2000
  • minimum mapping quality MQ<=25
  • minimum quality for SNPs QUAL<=30
  • minimum quality for indels QUAL<=60

Getting the data

Complete lists of exome-seq data can be found under the files tab of the cell lines and data browser or in the dataset indexes on the FTP site.

  • Raw sequencing reads – Distributed in the cram file format. Any cell line can have multiple associated cram files; each corresponds to a single lane of sequencing.
  • BWA alignment – Distributed in the bam file format. These are the input files used for variant calling, after the post-alignment improvements. We distribute one bam file per cell line.
  • Mpileup variant calls – Distributed in vcf file format. These are the genotypes called directly from the aligned sequence, before phasing or imputation. We distribute a single-sample vcf file for each cell line.
  • Imputed and phased genotypes – Distributed in vcf file format. These contain the output from SHAPEIT and IMPUTE2. We distribute a single-sample vcf file for each cell line, containing genotypes imputed to the 1000genomes and UK10K reference panels.

For managed access cell lines, exome-seq files are archived in the EGA. The data browser contains links to the relevant EGA dataset page, from where researchers can request access to the data.

For open access cell lines, exome-seq files are archived in ENA. Data are openly available to anybody, and the data browser contains direct links to the files on the ENA FTP server.

Resources

HipSci’s FTP site contains: