Coverage maps of SNP chips and their coverage variation across genome

Genome-wide association (GWA) studies rely on commercial SNP genotyping panels, for which a common evaluation criterion has been the global coverage of the genome. However, the level of variation in coverage is also important for evaluation of SNP chips. Here, we provide a detailed coverage map for currently available SNP chips.

Coverage map

This coverage map (Supplementary Figure 1 of our paper, gzipped version, 6.05MB) contains a detailed, high-resolution graph of the local coverage rate of four commercial SNP chips: Affymetrix SNP Array 5.0 (in black) and 6.0 (blue), and Illumina HumanHap300 (red), HumanHap550 (green), HumanHap650Y (cyan), and Human1M (purple). The red bars at the top and bottom indicate the transcription regions of known protein coding genes (based on the knownGene table obtained from the UCSC human genome release hg17)

The variation of coverage for the five SNP chips (global coverage shown as dotted lines):

Raw data for local coverage: CEU, CHB, JPT, YRI. The explanations of the columns are here.

Coverage of known genes

The variation of coverage for known genes with ≥5 HapMap common SNPs in between the transcriptional start and end positions:

Supplementary Table 1 of our paper for gene coverage, in which gene regions are defined between transcriptional start and end points (inclusive): CEU, CHB, JPT, YRI. The explanations of the columns are here.

Gene coverage results when 5kb is added to both end of a gene: CEU.pm5k, CHB.pm5k, JPT.pm5k, YRI.pm5k.

Gene coverage results when 10kb is added to both end of a gene: CEU.pm10k, CHB.pm10k, JPT.pm10k, YRI.pm10k.

Note: For short genes, the results may change dramatically across the above three definitions of gene regions.

Methodology

For each region, we use the formula of Barrett and Cardon (Nat Genet 2006;38: 659-662) to estimate coverage rate: [L / (R – T) × (G – T) + T] / G, where
  • R: The number of common SNPs in the HapMap
  • T: The number of common SNPs on the SNP chip
  • L: The number of common SNPs not on the SNP chip but are tagged at r2≥.8 by at least one SNP in the chip within 250 kb
  • G: The total number of common SNPs in the region, including those that have already been discovered and those that have yet to be discovered. For a 1 Mb region, the average number of common SNPs is estimated to be about 2,631 based on the estimated numbers of common SNPs (7.5×106) [Barrett and Cardon 2006] and euchromatic base pairs (2.85×109) in the human genome.
  • Note: We recognize that different estimates of G may lead to different values of local coverage rate. However, the above formula can be rewritten as L / (R – T) + [1 – L / (R – T)] × T / G, which indicates that the value of G has little effect on the final estimate as long as the fraction of common SNPs included in the SNP panel, T / G, is small, which is true for the five SNP chips we evaluated.

SNP Array 6.0 and Human1M: Each chip has about 10% SNPs that are not on the HapMap. According to Affymetrix, the SNP Array 6.0 has 934,968 SNPs, but with 99,854 SNPs (10.7%) not on the HapMap, including 72,379 common SNPs for CEU, 76,016 for CHB, 70,356 for JPT, and 83,412 for YRI. According to Illumina, the Human1M has 1,072,820 SNPs, but with 125,688 SNPs (11.7%) not on the HapMap including 70,995 common SNPs for CEU, 67,453 for CHB/JPT, and 77,729 for YRI. Because of this, their genomic coverage may be underestimated if only the HapMap SNPs were considered in coverage calculation.

To address this problem, we calculated an alternative coverage estimate as follows, using the SNP Array 6.0 as an example. Suppose there is an “updated HapMap dataset” that consists of the current HapMap SNPs and the SNPs on the SNP Array 6.0. Based on this “updated data”, we could estimate the number of common SNPs, denoted as R1, and the number of common SNPs on the chip, denoted as T1; for example, for Caucasians, R1 = R + 72,379 and T1 = T + 72,379. However, due to the lack of LD information between the “new” SNPs and the other HapMap SNPs, we do not know how many HapMap SNPs are tagged by these “new” SNPs, therefore L1 cannot be directly estimated. However, if we assume that the number of tagged common SNPs that are not on the chip increases proportionally with the number of common SNPs on the chip, that is, T1 / T = L1 / L, then L1 can be estimated as (T1 / T) × L. Therefore, based on the “updated HapMap data”, we could calculate the genomic coverage of the SNP Array 6.0 as [L1 / (R1 – T1) × (G – T1) + T1] / G.

Associated paper

Li M, Li C, Guan W (2008) Evaluation of coverage variation of SNP chips for genome-wide association studies. European Journal of Human Genetics (in press)

Other papers of relevance

Please contact Chun Li (chun.li@vanderbilt.edu) if you have any questions.

Topic revision: r25 - 18 Jan 2008, ChunLi
 

This site is powered by FoswikiCopyright © 2013-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Vanderbilt Biostatistics Wiki? Send feedback