What are the different types of SNPs?

What are the different types of SNPs?

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

When I search for this online I get answers such as substitutions, deletions, insertions etc. But I mean in the sense that I have been reading different terms infront of the word SNP such as: lead SNP, tag SNP, sentinel SNP, indirect SNP, direct SNP, secondary SNP, imputed SNP etc.

From my understanding lead/tag/sentinel are all the same (SNP with lowest P value in a locus, representing the variance in that section). An indirect SNP is also a lead/tag/sentinel SNP and indirect because these have not been genotyped but are in LD with the genotyped SNP + having that they have the lowest P value. Also a secondary SNP is a indepedently correlating SNP, and is specifically independent from the tag SNP? Is there a resource defining these sort of categories? Apologies if this is misinformed, I have been trying to learn about GWAS from different studies, and would appreciate any help to clarify this.

  1. Restriction Fragment Length Polymorphism (RFLP)
  2. Amplified Fragment Length Polymorphism (AFLP)
  3. Random Amplified Polymorphic DNA (RAPD)
  4. Cleaved Amplified Polymorphic Sequences (CAPS)
  5. Simple Sequence Repeat (SSR) Length Polymorphism
  6. Single Strand Conformational Polymorphism (SSCP)
  7. Heteroduplex Analysis (HA)
  8. Single Nucleotide Polymorphism (SNP)
  9. Expressed Sequence Tags (EST)
  10. Sequence Tagged Sites (STS)

Restriction Fragment Length Polymorphisms (RFLPs):

RFLPs refer to variations found within a species in the length of DNA fragments generated by specific endonuclease. RFLPs are first type of DNA markers developed to distinguish individuals at the DNA level. RFLP technique was developed before the discovery of Polymerase Chain Reaction (PCR).

The advantages, disadvantages and uses of this technique are presented below:

RFLP technique has several advantages. It is a cheaper and simple technique of DNA sequencing. It does not require special instrumentation. The majority of RFLP markers are co-dominant and highly locus specific. These are powerful tools for comparative and synteny mapping.

It is useful in developing other markers such as CAPS and INDEL. Several samples can be screened simultaneously by this technique using different probes. RFLP genotypes for single copy or low copy number genes can be easily scored and interpreted.

Developing sets of RFLP probes and markers is labour intensive. This technique requires large amount of high quality DNA. The multiplex ratio is low, typically one per gel. The genotyping throughput is low. It involves use of radioactive chemicals. RFLP finger prints for multi-gene families are often complex and difficult to score. RFLP probes cannot be shared between laboratories.

They can be used in determining paternity cases. In criminal cases, they can be used in determining source of DNA sample. They can be used to determine the disease status of an individual. They are useful in gene mapping, germplasm characterization and marker assisted selection. They are useful in detection of pathogen in plants even if it is in latent stage.

Amplified Fragment Length Polymorphism (AFLP):

AFLPs are differences in restriction fragment lengths caused by SNPs or INDELs that create or abolish restriction endonuclease recognition sites. AFLP assays are performed by selectively amplifying a pool of restriction fragments using PCR. RFLP technique was originally known as selective restriction fragment amplification.

It provides very high multiplex ratio and genotyping throughput. These are highly reproducible across laboratories. No marker development work is needed however, AFLP primer screening is often necessary to identify optimal primer specificities and combinations.

No special instrumentation is needed for performing AFLP assays however, special instrumentation is needed for co-dominant scoring.

Start-up costs are moderately low. AFLP assays can be performed using very small DNA samples (typically 0.2 to 2.5 pg per individual). The technology can be applied to virtually any organism with minimal initial development.

The maximum polymorphic information content for any bi-allelic marker is 0.5. High quality DNA is needed to ensure complete restriction enzyme digestion. DNA quality may or may not be a weakness depending on the species. Rapid methods for isolating DNA may not produce sufficiently clean template DNA for AFLP analysis.

Proprietary technology is needed to score heterozygotes and ++ homozygotes. Otherwise, AFLPs must be dominantly scored. Dominance may or may not be a weakness depending on the application.

The homology of a restriction fragment cannot be unequivocally ascertained across genotypes or mapping populations. Developing locus specific markers from individual fragments can be difficult and does not seem to be widely done.

The switch to non-radioactive assays has not been rapid. Chemiluminescent AFLP fingerprinting methods have been developed and seem to work well.

The fingerprints produced by fluorescent AFLP assay methods are often difficult to interpret and score and thus do not seem to be widely used. AFLP markers often densely cluster in centromeric regions in species with large genomes, e.g., barley (Hordeum vulgare L.) and sunflower (Helianthus annuus L.).

This technique has been widely used in the construction of genetic maps containing high densities of DNA marker. In plant breeding and genetics, AFLP markers are used in varietal identification, germplasm characterization, gene tagging and marker assisted selection.

Random Amplified Polymorphic DNA (RAPDs):

RAPD refers to polymorphism found within a species in the randomly amplified DNA generated by restriction endonuclease enzyme. RAPDs are PCR based DNA markers. RAPD marker assays are performed using single DNA primer of arbitrary sequence.

RAPD primers are readily available being universal. They provide moderately high genotyping throughput. This technique is simple PCR assay (no blotting and no radioactivity). It does not require special equipment. Only PCR is needed. The start-up cost is low.

RAPD marker assays can be performed using very small DNA samples (5 to 25 ng per sample). RAPD primers are universal and can be commercially purchased. RAPD markers can be easily shared between laboratories. Locus-specific, co-dominant PCR-based markers can be developed from RAPD markers. It provides more polymorphism than RFLPs.

The detection of polymorphism is limited. The maximum polymorphic information content for any bi-allelic marker is 0.5. This technique only detects dominant markers. The reproducibility of RAPD assays across laboratories is often low. The homology of fragments across genotypes cannot be ascertained without mapping. It is not applicable in marker assisted breeding programme.

This technique can be used in various ways such as for varietal identification, DNA fingerprinting, gene tagging and construction of linkage maps. It can also be used to study phylogenetic relationship among species and sub-species and assessment of variability in breeding populations.

Cleaved Amplified Polymorphic Sequences (CAPS):

CAPS polymorphisms are differences in restriction fragment lengths caused by SNPs or INDELs that create or abolish restriction endonuclease recognition sites in PCR amplicons produced by locus-specific oligonucleotide primers.

CAPS assays are performed by digesting locus-specific PCR amplicons with one or more restriction enzymes and separating the digested DNA on agarose or polyacrylamide gels.

CAPS analysis is versatile and can be combined with single strand conformational polymorphim (SSCP), sequence-characterized amplified region (SCAR), or random amplified polymorphic DNA (RAPD) analysis to increase the chance of finding a DNA polymorphism.

Michaels and Amasino (1998) proposed a variant of the CAPS method called dCAPS based on SNPs.

The genotyping throughput is moderately high. It is a simple PCR assay. Markers are developed from the DNA sequences of previously mapped RFLP markers. Most CAPS markers are co- dominant and locus specific. No special equipment is needed to perform manual CAPS marker assays.

CAPS marker assays can be performed using semi-automated methods, e.g., fluorescent assays on a DNA sequencer (e.g., ABI377). Start-up costs are low for manual assay methods. CAPS assays can be performed using very small DNA samples (typically 50 to 100 ng per individual). Most CAPS genotypes are easily scored and interpreted. CAPS markers are easily shared between laboratories.

Typically, a battery of restriction enzymes must be tested to find polymorphisms. Although CAPS markers still nave great utility and should not be over looked, other methods have emerged as tools for screening locus-specific DNA fragments for polymorphisms, e.g., SNP assays. The development of easily scored and interpreted assays may be difficult for some genes, especially those belonging to multi-gene families.

This is straightforward way to develop PCR-based markers from the DNA sequences of previously mapped RFLP markers. It is a simple method that builds on the investment of an RFLP map and eliminates the need for DNA blotting.

Simple Sequence Repeats (SSRs):

Simple sequence repeats (SSRs) or microsatellites are tandemly repeated mono-, di-, tri-, tetra-, penta-, and hexanucleotide motifs. SSR length polymorphisms are caused by differences in the number of repeats. SSR loci are individually amplified by PCR using pairs of oligonucleotide primers specific to unique DNA sequences flanking the SSR sequence.

Jeffreys (1985) showed that some restriction fragment length polymorphisms are caused by VNTRs. The name “mini satellite” was coined because of the similarity of VNTRs to larger satellite DNA repeats.

SSR markers tend to be highly polymorphic. The genotyping throughput is high. This is a simple PCR assay. Many SSR markers are multi-allelic and highly polymorphic. SSR markers can be multiplexed, either functionally by pooling independent PCR products or by true multiplex- PCR. Semi-automated SSR genotyping methods have been developed. Most SSRs are co-dominant and locus specific.

No special equipment is needed for performing SSRs assays however, special equipment is needed for some assay methods, e.g., semi-automated fluorescent assays performed on a DNA sequences. Start-up costs are low for manual assay methods (once the markers are developed). SSR assays can be performed using very small DNA samples (

100 ng per individual). SSR markers are easily shared between laboratories.

The development of SSRs is labor intensive. SSR marker development costs are very high. SSR markers are taxa specific. Start-up costs are high for automated SSR assay methods. Developing PCR multiplexes is difficult and expensive. Some markers may not multiplex.

SSR markers are used for mapping of genes in eukaryotes.

Single Strand Conformational Polymorphisms (SSCPs):

SSCPs refer to DNA polymorphisms produced by differential folding of single-stranded DNA harboring mutations. The conformation of the folded DNA molecule is produced by intra-molecular interactions and is thus a function of the DNA sequence.

SSCP marker assays are performed using heat-denatured DNA on non-denaturing DNA sequencing gels. Special gels (e.g., mutation detection enhancement gels) have been developed to enhance the discovery of single-strand conformational polymorphisms caused by INDELs, SNPs, or SSRs.

It is a simple PCR assay. Many SSCP markers are multi-allelic and highly polymorphic. Most SSCPs are co-dominant and locus specific. No special equipment is needed. Start-up costs are low. SSCP marker assays can be performed using very small DNA samples (typically 10 to 50 ng per individual).

SSCP markers are easily shared between laboratories. SSCP gels can be silver stained (no radioactivity). The complexity of PCR products can be assessed and individual fragments can be isolated and sequenced.

The development of SSCP markers is labor intensive. SSCP marker development costs can be high. SSCP marker analysis cannot be automated.

SSCPs have been widely used in human genetics to screen disease genes for DNA polymorphisms. Although SSCP analysis does not uncover every DNA sequence polymorphism, the methodology is straight forward and a significant number of polymorphisms can be discovered. SSCP analysis can be a powerful tool for assessing the complexity of PCR products.

Heteroduplex Analysis (HA):

It refers to DNA polymorphisms produced by separating homo-duplex from heteroduplex DNA using non-denaturing gel electrophoresis or partially denaturing high performance liquid chromatography.

Single-base mismatches between genotypes produce hetero-duplexes thus, the presence of hetero-duplexes signals the presence of DNA polymorphisms. Heteroduplex analyses can be rapidly and efficiently performed on numerous genotypes before specific alleles are sequenced, thereby greatly reducing sequencing costs in SNP discovery and SNP marker development.

It is a powerful method for SNP discovery. Automated HA can be performed using HPLC. Most heteroduplex markers are co-dominant and locus specific. HA can be performed using very small DNA samples (typically 10 to 50 ng per individual). HA markers are easily shared between laboratories.

Requires special equipment. One protocol may not be sufficient for heteroduplex analyses of different targets via HPLC.

Heteroduplex analysis has been mostly used in human genetics to screen disease genes for DNA polymorphism. In plant breeding, it is used for detection of pathogens which are in latent stage and thus useful in selection of disease free plants. It is also useful in the discovery of single nucleotide polymorphism.

Single Nucleotide Polymorphism (SNP):

The variations which are found at a single nucleotide position are known as single nucleotide polymorphisms or SNP. Such variation results due to substitution, deletion or insertion. This type of polymorphisms has two alleles and also called bialleleic loci. This is the most common class of DNA polymorphism. It is found both in natural lines and after induced mutagenesis. Main features of SNP markers are given below.

1. SNP markers are highly polymorphic and mostly bialleleic.

2. The genotyping throughput is very high.

3. SNP markers are locus specific.

4. Such variation results due to substitution, deletion or insertion.

5. SNP markers are excellent long term investment.

6. SNP markers can be used to pinpoint functional polymorphism.

7. This technique requires small amount of DNA.

SNP markers are useful in gene mapping. SNPs help in detection of mutations at molecular level. SNP markers are useful in positional cloning of a mutant locus. SNP markers are useful in detection of disease causing genes.

Most of the SNPs are bialleleic and less informative than SSRs. Multiplexing is not possible for all loci. Some SNP assay techniques are costly. Development of SNP markers is labour oriented. More (three times) SNPs are required in preparing genetic maps than SSR markers.

SNPs are useful in preparing genetic maps. They have been used in preparing human genetic maps. In plant breeding, SNPs have been used to lesser extent.

Expressed Sequence Tags (EST):

Expressed Sequence Tags (ESTs) are small pieces of DNA and their location and sequence on the chromosome are known. The variations which are found at a single nucleotide position are known. The term Expressed Sequence Tags (ESTs) was first used by Venter and his colleagues in 1991. Main features of EST markers are given below.

1. ESTs are short DNA sequences (200-500 nucleotide long).

2. They are a type of sequence tagged sites (STS).

3. ESTs consist of exons only.

It is a rapid and inexpensive technique of locating a gene. ESTs are useful in discovering new genes related to genetic diseases. They can be used for tissue specific gene expression.

ESTs have lack of prime specificity. It is a time consuming and labour oriented technique. The precision is lesser than other techniques. It is difficult to obtain large (> 6kb) transcripts. Multiplexing is not possible for all loci.

ESTs are commonly used to map genes of known function. They are also used for phylogenetic studies and generating DNA arrays.

Sequence Tagged Sites (STS):

In genomics, a sequence tagged site (STS) is a short DNA sequence that has a single copy in a genome and whose location and base sequence are known. Main features of STS markers are given below.

1. STSs are short DNA sequences (200-500 nucleotide long).

2. STSs occur only once in the genome.

3. STS are detected by PCR in the presence of all other genomic sequences.

4. STSs are derived from cDNAs.

STSs are useful in physical mapping of genes. This technique permits sharing of data across the laboratories. It is a rapid and most specific technique than DNA hybridization techniques. It has high degree of accuracy. It can be automated.

Development of STS is a difficult task. It is time consuming and labour oriented technique. It require high technical skill.

STS is the most powerful physical mapping technique. It can be used to identify any locus on the chromosome. STSs are used as standard markers to find out gene in any region of the genome. It is used for constructing detailed maps of large genomes.

Unboxing mutations: Connecting mutation types with evolutionary consequences

Emma L. Berdan, Department of Ecology, Environment and Plant Sciences, Science for Life Laboratory, Stockholm University, Stockholm SE-10691, Sweden.

Inês Fragata, cE3c – Centre for Ecology, Evolution and Environmental Changes, Faculdade de Ciências, Universidade de Lisboa, Lisboa, Portugal.

Laboratory of Genetics, University of Wisconsin-Madison, Madison, WI, USA

Department of Ecology, Environment and Plant Sciences, Science for Life Laboratory, Stockholm University, Stockholm, Sweden

School of Biological Sciences – Organisms and the Environment, University of East Anglia, Norwich, UK

Department of Organismal Biology – Systematic Biology, Science for Life Laboratory, Uppsala University, Uppsala, Sweden

IST Austria, Klosterneuburg, Austria

Faculty of Biosciences and Aquaculture, Nord University, Bodø, Norway

cE3c – Centre for Ecology, Evolution and Environmental Changes, Faculdade de Ciências, Universidade de Lisboa, Lisboa, Portugal

Emma L. Berdan, Department of Ecology, Environment and Plant Sciences, Science for Life Laboratory, Stockholm University, Stockholm SE-10691, Sweden.

Inês Fragata, cE3c – Centre for Ecology, Evolution and Environmental Changes, Faculdade de Ciências, Universidade de Lisboa, Lisboa, Portugal.

Department of Ecology, Environment and Plant Sciences, Science for Life Laboratory, Stockholm University, Stockholm, Sweden

Emma L. Berdan, Department of Ecology, Environment and Plant Sciences, Science for Life Laboratory, Stockholm University, Stockholm SE-10691, Sweden.

Inês Fragata, cE3c – Centre for Ecology, Evolution and Environmental Changes, Faculdade de Ciências, Universidade de Lisboa, Lisboa, Portugal.

Laboratory of Genetics, University of Wisconsin-Madison, Madison, WI, USA

Department of Ecology, Environment and Plant Sciences, Science for Life Laboratory, Stockholm University, Stockholm, Sweden

School of Biological Sciences – Organisms and the Environment, University of East Anglia, Norwich, UK

Department of Organismal Biology – Systematic Biology, Science for Life Laboratory, Uppsala University, Uppsala, Sweden

IST Austria, Klosterneuburg, Austria

Faculty of Biosciences and Aquaculture, Nord University, Bodø, Norway

cE3c – Centre for Ecology, Evolution and Environmental Changes, Faculdade de Ciências, Universidade de Lisboa, Lisboa, Portugal

Emma L. Berdan, Department of Ecology, Environment and Plant Sciences, Science for Life Laboratory, Stockholm University, Stockholm SE-10691, Sweden.

Inês Fragata, cE3c – Centre for Ecology, Evolution and Environmental Changes, Faculdade de Ciências, Universidade de Lisboa, Lisboa, Portugal.


A key step in understanding the genetic basis of different evolutionary outcomes (e.g., adaptation) is to determine the roles played by different mutation types (e.g., SNPs, translocations and inversions). To do this we must simultaneously consider different mutation types in an evolutionary framework. Here, we propose a research framework that directly utilizes the most important characteristics of mutations, their population genetic effects, to determine their relative evolutionary significance in a given scenario. We review known population genetic effects of different mutation types and show how these may be connected to different evolutionary outcomes. We provide examples of how to implement this framework and pinpoint areas where more data, theory and synthesis are needed. Linking experimental and theoretical approaches to examine different mutation types simultaneously is a critical step towards understanding their evolutionary significance.

Materials and methods

Sources of WGS benchmarking dataset acquisition

NA12878 (HG001) WGS data

The NIST reference material NA12878 (HG001) was sequenced at NIST, Gaithersburg, MD for the PrecisionFDA Truth Challenge. WGS library preparation was conducted using Illumina TruSeq (LT) DNA PCR-free sample Prep kits (FC-121-3001), and paired-end reads, insert size:

550 bp were generated on HiSeq 2500 platform with rapid run mode (2 flow cells per genome). Raw paired-end fastq files (HG001-NA12878-50x_1.fastq.gz and HG001-NA12878-50x_2.fastq.gz) were obtained from In addition, another set of NA12878 raw WGS data sequenced in Supernat et al. was downloaded from the NCBI SRA repository (accession number: SRR6794144) 24 , using the SRA Toolkit.

“Synthetic-diploid” WGS data

Paired-end raw fastq files of “synthetic-diploid” WGS data were obtained from the European Nucleotide Archive (accession number: SAMEA3911976). The reference material, from a mixture of CHM1 (SAMN02743421) and CHM13 (SAMN03255769) cell lines at 1:1 ratio, was sequenced on HiSeq X10 platform using a PCR-free library protocol (Kapa Biosystems reagents) 27 . Two independently replicated runs, ERR1341793 (raw reads ERR1341793_1.fastq.gz and ERR1341793_2.fastq.gz downloaded from and ERR1341796 (raw reads ERR1341796_1.fastq.gz and ERR1341796_2.fastq.gz downloaded from were used for the benchmarking exercises.

Simulated WGS data

In addition to real WGS data, reads were synthesized in silico using the tool Neat-GenReads v2.0 35 . Briefly, two independent sets of simulated paired-end reads in fastq format, together with true positive variant datasets in VCF format, were generated from a random mutation profile (average mutation rate: 0.002) and a user defined mutation profile (using the golden truth callset assembled from CHM1 and CHM13 haploid cell lines), respectively. The simulation was performed on the basis of the human reference genome build GRCH37 decoy, with a read length of 150 bp, an average coverage of 40X, and a median insert size of 350 ± 70 bp.

Implementation of variant calling pipelines

Germline variant calling was performed using the pipelines: (1) GATK v4.1.0.0 36 , (2) DRAGEN v3.3.11 and (3) DeepVariant v0.7.2 (see flowchart in Fig. 1) 23 .

The flowchart of benchmarking analysis of different variant calling pipeline (GATK, DRAGEN and DeepVariant) combinations.

The GATK pipeline workflow was applied following best practices ( The raw paired-end reads were mapped to the GRCH37.37d5 reference genome by BWA-mem v0.7.15 37 . Aligned reads were converted to BAM files and sorted based on genome position after marking duplicates using Picard modules. The raw BAM files were refined by Base Quality Score Recalibration (BQSR) using default parameters. The variant calling (SNPs and indels) was performed with the HaplotypeCaller module. To speed up efficiency, the whole genome was split into 14 fractions and run in parallel, followed by merging of all runs into a final VCF file. Additionally, we used Variant Quality Score Recalibration (VQSR) to filter the original VCF files following GATK recommendations for parameter settings: HapMap 3.3, Omni 2.5, dbSNP 138, 1000 Genome phase I for SNPs training sets, and Mills- and 1000 Genome phase I data for indels.

The DRAGEN pipeline ( followed a similar procedure as described for GATK best practices, including mapping and alignment, sorting, duplicate marking, haplotype calling and VQSR filtering.

The DeepVariant pipeline was run via a Singularity framework in accordance with online instructions ( In general, this consisted of three steps: (1) make_example module—consumes reads and the reference genome to create the TensorFlow example for evaluation with deep learning models. (2) call_variants module—consumes TFRecord files created by the make_example module and evaluates the model on each example in the input TFRecord. (3) postprocess_variants module—reads the output TFRecord files from the call_variants module, combines multi-allelic records and writes out a VCF file. DeepVariant only used transformed aligned sequencing reads for variant calling, and so processed BAM file from GATK or DRAGEN pipelines was fed as input.

Six VCF files were finally generated per each WGS dataset these represent different parameter settings and processing combinations of the pipelines in terms of their workflows, as depicted in Fig. 1 (i.e. DV_gatk4—GATK for BAM file and DeepVariant for variant calling DV_dragen3—DRAGEN for BAM file and DeepVariant for variant calling GATK4_raw—GATK for both BAM file and variant calling GATK4_vqsr—callset from GATK4_raw filtered with VQSR Dragen3_raw—DRAGEN for both BAM file and variant calling and Dragen3_vqsr—callset from Dragen3_raw filtered with VQSR). In addition, a merged VCF file was generated by combining the variants called by DV_gatk4, DV_dragen3, GATK4_raw and Dragen3_raw using bcftools v1.10.2 38 , and only variants called with the support of at least two pipelines were kept.

Computing environment and resources

Variant calling processes were run both on a high-performance computing (HPC) cluster and on a local virtual machine (VM) within the sensitive data platform (TSD) at the University of Oslo. The settings of each node in the HPC cluster include 64 AMD CPU cores with a total size of 512 GB physical memory, a CentOS 7 operating system and a BeeGFS network file system. The FPGA hardware infrastructure was installed on one node specific for the DRAGEN pipeline application. The local VM had 40 CPU cores with a total 1.5 TiB physical memory, 2 TiB local disk with ext4 file system format and CentOS 7.

Benchmark consensus of VCF files

The gold standard truth callset and high confidence genomic intervals (NIST v3.3.2) for the NA12878 (HG001) dataset were obtained from and To calculate the performance metrics, we used (version 0.3.8, vcfeval comparison engine) for comparison of diploid genotypes at the haplotype level The variant calling of WGS data from the mixture of CHM1 and CHM13 was compared to the “synthetic-diploid” benchmark truth callset and high-confidence regions (i.e. full.37d5.vcf.gz and full.37d5.bed.gz, which are included in the CHM-eval kit tool and available at, version 20180222) using vcfeval comparison engine 27 . For benchmarking variants identified in simulated WGS data, we performed a consensus evaluation against their truth positive callsets, both with and without high-confidence regions (i.e. HG001_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_highconf_nosomaticdel.bed), respectively. The definitions of true positive (TP), false positive (FP) and false negative (FN) were based on the types of variant matching stringencies ”genotype match” (most strict—truth and query are considered as true positives when their unphased genotypes and alleles can be phased to produce a matching pair of haplotype sequences for a diploid genome) and ”local match” (less strict—truth and query variants are counted as true positives if their reference span intervals are closer than a pre-defined local matching distance) 39 . Precision, Recall and F1-score were calculated as TP/(TP + FP), TP/(TP + FN) and 2*TP/(2*TP + FN + FP), respectively.

Definition of genome features for stratification analysis

Different types of genome contexts and biological features were applied in stratification analysis 33 . (1) Low complexity regions: ‘*_merged_slop5.bed.gz’ defined by the Global Alliance for Genomics and Health (GA4GH) Benchmarking Team ( (2) GC content intervals: ‘*_slop50.bed.gz’ defined by GA4GH Benchmarking Team ( (3) coding/conserved regions: ‘refseq_uion_cds.sort.bed.gz’ defined by GA4GH Benchmarking Team ( were used for simulated data analysis ‘func.37m.bed.gz’ as defined in the CHM-eval kit tool ( was used for ‘synthetic-diploid’ data analysis. (4) B allele frequency: it was calculated using AD fields in the VCF file, which records the number of reads coverage for the reference and alternative alleles. In addition, we down-sampled raw reads in real (NA12878_PrecisionFDA and NA12878_ SRR6794144) and simulated data using the tool seqtk v1.3 40 , and generated read files in 10× and 20× sequencing depth for benchmarking comparisons.


Phenotype-SNP associations were extracted from GWAS data in the NHGRI GWAS catalog ( 15). It contains manually curated entries of published GWAS, in which SNPs were associated with diseases, phenotypes, and traits. Unless otherwise stated we used the version from on 9 September 2015. Gene symbols were taken from Genenames ( 16), while the genomic locations of SNPs and genes were taken from the UCSC genome browser ( 17). Biological pathways and their associated genes were taken from the KEGG pathway database (release 53) ( 18), and from ConsensusPathDB (CPDB) ( 19). From CPDB we took only KEGG pathways. Genomic indels were taken from DGV ( 20), a database of genomic structure variants (SV). These were used for the analysis of whether more indels fall in phenotype-associated SNP-gene regions. For the analysis of indels around SG regions (see below) we used GWAS Catalog version downloaded from on 11/2011 with merged GWAS entries of the same phenotype as in ( 14).

Association of phenotypes to pathways

We define a SNP as a ‘phenotype-associated SNP’ if it is associated with a phenotype in the NHGRI GWAS catalog (see ( 15)). To determine whether a pathway is significantly associated with a phenotype we assess whether the genes of that phenotype fall within that pathway significantly more than expected by chance. The next paragraph describes the background model on which we determine the number of genes that are expected to cluster into a pathway by chance.

Assessing significance of phenotype-pathway associations

Assessment of the significance of the association between a phenotype and a pathway in a given distance cutoff x (e.g. 10 Kbps, 200 Kbps), hereby referred to as ‘cutoff’, is done as in ( 14), with a slight variation regarding the background model. Briefly, for each phenotype, with s SNPs associated with it according to GWAS, the number of phenotype-associated genes, denoted g, was recorded. For a distance cutoff of x, g is number of genes that are less than x bps from any of the s SNPs. We also recorded how many of these g genes fall into the same pathway. SNPs from GWAS have more genes in their vicinity compared to all SNPs (Figure 1), to account for that the expected number by chance was assessed by repeatedly picking s random SNPs from GWAS, mapping them to the genes that are less than x bps away and recording how many of these genes fall into the same pathway (note that in ( 14) genes were chosen randomly, and not SNPs). For each phenotype-pathway pair, this was repeated 1000 times. A phenotype is said to be significantly associated with a pathway with P-value <0.001, if <0.001 of these random resamplings resulted in an equal or greater number of genes which clustered into the pathway.

Distribution of SNPs according to their proximity to the nearest gene. Bars depict percentage of SNPs that have a gene within a certain distance from them. Blue bars represent all known SNPs, the red bars represent only SNPs that were found by GWAS to be associated with phenotypes. The X-axis represents distance from the SNPs the Y-axis represents the percentage of SNPs that have a gene within that distance from them.

Distribution of SNPs according to their proximity to the nearest gene. Bars depict percentage of SNPs that have a gene within a certain distance from them. Blue bars represent all known SNPs, the red bars represent only SNPs that were found by GWAS to be associated with phenotypes. The X-axis represents distance from the SNPs the Y-axis represents the percentage of SNPs that have a gene within that distance from them.

The random model should test whether genes that are close to phenotype-associated SNPs cluster into pathways more than expected by chance. However, it should take into account that neighboring genes on the chromosome might cluster into the same pathway regardless of the SNPs. We define a ‘segment’ as a stretch of contiguous base pairs encompassing one or more SNPs and the DNA around them up to a given distance cutoff. For example, for a phenotype with three associated SNPs in the following chromosomal locations: 9000, 35,000 and 40,000, on chromosome 3, using a distance cutoff of 10 Kbps, we should extend a segment 10 Kbps in each direction around each of these SNPs. In practice, we will end up with two segments, the first at 0–19,000, and the second at 25,000–50,000. Note that given the proximity of the first SNP to the end of the chromosome the effective size of the segment around it is 1.9 Kbps and not 2 Kbps. Given the proximity of the other two SNPs to each other their segments partially overlap to yield one joint segment of 2.5 Kbps rather than two separate segments of 2 Kbps each. Thus, these three SNPs highlight two chromosomal segments, one of 1.9 Kbps and one of 2.5 Kbps. To generate a random model we now select two segments of the same size as the two segments around the SNPs. To avoid biases, we restrict our selection to segments that surround reported SNPs. In particular, we first randomly chose two segments that are centered by a SNP, one that is 1.9 Kbps long and one that is 2.5 Kbps long. Next, in order to account for the original number of SNPs, the second segment, originally containing two SNPs, was divided by two arbitrary ‘SNPs’ distributed equally along the segment such that when applying the cutoff their combined segment will span 2.5 Kbps. As described in ( 14), the framework accounts for multiple testing. Briefly, let n be a number of phenotypes with associated SNPs that were found using the resampling procedure above to be significantly associated with pathways. Since the P-value for each phenotype was evaluated separately, one needs to assess the P-value of the overall result for all phenotypes. To this end, for each phenotype from n, a pseudo phenotype was created by randomly picking segments, as described above, corresponding in number and length to the original phenotype segments (note that in ( 14) pseudo phenotypes were made by randomly choosing genes, not segments). Then, the resampling procedure above is repeated for each of these n sets, to determine whether this pseudo phenotype turns out to pass the significance assessment described above. The number of ‘significant’ phenotype-pathway associations for each of the pseudo phenotypes is recorded. This is repeated 100 times, to yield a P-value for obtaining a certain number of significant phenotype-pathway associations for all phenotypes. The red bars/line in Figure 2 represent the median of these resampling procedures. Error bars on the red bars/line represent standard deviations.

Associations based on mapping a SNP to genes at different intervals. (A) Number of pathways significantly associated with phenotypes if we map a SNP to all genes within a certain distance, in non-cumulative distance cutoffs (e.g. genes that are between 0–100 Kbps are not considered for the 100–200 Kbps interval, etc.). Red bars represent the number of associations expected by chance (median of 100 random resampling repetitions, see Methods). (B) Number of pathways significantly associated with phenotypes when for each distance cutoff genes are considered cumulatively (all genes between the SNPs and the distance cutoff are considered). Red line represents the number expected by chance, as above.

Associations based on mapping a SNP to genes at different intervals. (A) Number of pathways significantly associated with phenotypes if we map a SNP to all genes within a certain distance, in non-cumulative distance cutoffs (e.g. genes that are between 0–100 Kbps are not considered for the 100–200 Kbps interval, etc.). Red bars represent the number of associations expected by chance (median of 100 random resampling repetitions, see Methods). (B) Number of pathways significantly associated with phenotypes when for each distance cutoff genes are considered cumulatively (all genes between the SNPs and the distance cutoff are considered). Red line represents the number expected by chance, as above.

Assessing relationships between SNPs and insertions/deletions

Defining SNP-gene regions

We define a SNP-gene (SG) region as the chromosome area between a gene and a SNP to which it is assigned. We use this definition to explore whether more indels tend to occur inside phenotype-associated SG regions than non-associated SG regions.

Mapping indels to SG regions

To assess whether there is a relationship between SNPs and indels, we tested whether extracted indels from the DGV database reside in regions that are between phenotype-associated SNPs and linked genes (i.e. genes that contribute to a significant phenotype-pathway association). We defined two types of genomic regions that lie between a SNP and a gene (SG regions). A linked SG region lies between a phenotype-associated SNP and a gene that falls within a pathway significantly associated with that phenotype. In a non-linked SG region, the gene does not fall within a pathway that is significantly associated with the phenotype. Finally, non-SG regions are all regions that are not between a SNP and a gene. We compared the amount of indels found in these three types of genomic regions.

Note that we cross-referenced the locations of all deletions with the linked SG group, as well as the two other groups, in order to calculate the amount of deletions per group. Since the groups vary in size, i.e. the number of regions and their lengths are different for each group, we normalized the number of indels per nucleotide. For example, when considering SG regions that are 0.5–1 Mbps, we took all the linked SNP-gene pairs that are more than 0.5 Mbps but less than 1 Mbps apart. We then summed the length of all these regions in nucleotides. Then, we took all the known indels from DGV that fall within any of these regions and summed their cumulative length. Finally we divided the total length of the regions by the total length of the indels. The resulting number is the average different indels in which each nucleotide in the region appears. This was repeated for all region sizes and for all region types. Note, that currently, each position in the human genome appears, on average, in roughly 2 known indels.

In order to calculate significance for the amount of deletions in the group of linked SG regions, we employed random testing on each of our control groups. That is, we merged all the regions of a certain size, regardless of whether they come from linked SG regions or from a control. For each random run, two sets of 100 regions were randomly selected from the group and the amount of indels per nucleotide was calculated for each group, and the difference between the two groups was calculated. This was done 1000 times for each control group.

Genetic marker

Our editors will review what you’ve submitted and determine whether to revise the article.

Genetic marker, any alteration in a sequence of nucleic acids or other genetic trait that can be readily detected and used to identify individuals, populations, or species or to identify genes involved in inherited disease. Genetic markers consist primarily of polymorphisms, which are discontinuous genetic variations that divide individuals of a population into distinct forms (e.g., AB versus ABO blood type or blond hair versus red hair). Genetic markers play a key role in genetic mapping, specifically in identifying the positions of different alleles that are located close to one another on the same chromosome and tend to be inherited together. Such linkage groups can be used to identify unknown genes that influence disease risk. Technological advances, especially in DNA sequencing, have greatly increased the catalogue of variable sites in the human genome.

Multiple types of polymorphisms serve as genetic markers, including single nucleotide polymorphisms (SNPs), simple sequence length polymorphisms (SSLPs), and restriction fragment length polymorphisms (RFLPs). SSLPs include repeat sequences, variations known as minisatellites (variable number of tandem repeats, or VNTRs) and microsatellites (simple tandem repeats, STRs). Insertions/deletions (indels) are another example of a genetic marker.

In the human genome, the most common types of markers are SNPs, STRs, and indels. SNPs affect only one of the basic building blocks—adenine (A), guanine (G), thymine (T), or cytosine (C)—in a DNA segment. For example, at a genomic location with the sequence ACCTGA in most individuals, some persons may contain ACGTGA instead. The third position in this example would be considered an SNP, since there is a possibility of either a C or a G allele occurring in the variable position. Because every individual inherits one copy of DNA from each parent, every person has two complementary copies of DNA. As a result, in the above example, three genotypes are possible: homozygous CC (two copies of the C allele at the variable position), heterozygous CT (one C and one T allele), and homozygous TT (two T alleles). The three genotype groups can be used as “exposure” categories to assess associations with an outcome of interest in a genetic epidemiology setting. Should such an association be identified, researchers may investigate the marked genomic region further to identify the particular DNA sequence in that region that has a direct biological effect on the outcome of interest.

A comparative assessment of SNP and microsatellite markers for assigning parentage in a socially monogamous bird

Single-nucleotide polymorphisms (SNPs) are preferred over microsatellite markers in many evolutionary studies, but have only recently been applied to studies of parentage. Evaluations of SNPs and microsatellites for assigning parentage have mostly focused on special cases that require a relatively large number of heterozygous loci, such as species with low genetic diversity or with complex social structures. We developed 120 SNP markers from a transcriptome assembled using RNA-sequencing of a songbird with the most common avian mating system—social monogamy. We compared the effectiveness of 97 novel SNPs and six previously described microsatellites for assigning paternity in the black-throated blue warbler, Setophaga caerulescens. We show that the full panel of 97 SNPs (mean Ho = 0.19) was as powerful for assigning paternity as the panel of multiallelic microsatellites (mean Ho = 0.86). Paternity assignments using the two marker types were in agreement for 92% of the offspring. Filtering individual samples by a 50% call rate and SNPs by a 75% call rate maximized the number of offspring assigned with 95% confidence using SNPs. We also found that the 40 most heterozygous SNPs (mean Ho = 0.37) had similar power to assign paternity as the full panel of 97 SNPs. These findings demonstrate that a relatively small number of variable SNPs can be effective for parentage analyses in a socially monogamous species. We suggest that the development of SNP markers is advantageous for studies that require high-throughput genotyping or that plan to address a range of ecological and evolutionary questions.

Table S1 List of forward, reverse, and extend primer sequences for 97 SNPs.

Table S2 Characteristics of six microsatellite markers used for parentage analyses.

Table S3 Input parameters for parentage analyses.

Table S4 Number of observed and expected cervus paternity assignments for all microsatellite and SNP panels.

Please note: The publisher is not responsible for the content or functionality of any supporting information supplied by the authors. Any queries (other than missing content) should be directed to the corresponding author for the article.


We have demonstrated how our approach can be consistently applied in different contexts, with timing information, with spatial data in the case of British Columbia, and with resistance data in the case of Moldova. This is an advance on what is possible with the fixed SNP-threshold approach, where there is no general way to adjust thresholds to take this context-specific information into account.

A fixed number of SNPs can arise from different number of transmissions depending on other factors, including the timing of transmission, selection for resistance, the substitution process, location, and factors we have not explicitly modeled (social contacts, host risk factors, pathogen factors). We have seen that sampled cases which are relatively close in genetic distance can nevertheless be separated by large distances in time. In this scenario, a simple SNP cut-off may place samples too close together for outbreak clustering purposes. In contrast, our new method is robust with respect to outlying cases which have been sampled at very different times compared with the majority of cases. These cases can make inference of timed phylogenetic trees challenging because the low genetic variation is hard to reconcile with the large time distance. Furthermore, true transmission clusters need not be clades in phylogenetic trees, because one cluster could descend from another but be separated by a long time or a large genetic distance (due to sampling effects). Accordingly, the clusters obtained by our method do not necessarily correspond to phylogenetic clades. We briefly discuss the application of our method to timed phylogenetic trees in the supplementary data , Supplementary Material online, with an example cluster which is not a clade.

Our probabilistic transmission method has certain advantages. It is relatively simple, requiring only the implementation of fast-running algorithms to estimate the time distributions the heavy machinery to run large simulation methodologies (like MCMC) is not required. The amount of information required for the model is limited and consists of as little as the SNP distances, the timing data and a knowledge about the substitution and transmission processes. Nevertheless it has the flexibility to be able to handle SNPs under selection, SNPs with a different substitution process and variability in the substitution and transmission processes, and it has the scope for extensions to include more epidemiological data. Even in data sets where there is not much timing information to work with, we have seen that the integration of information on resistance-conferring sites can be used within our framework to fine tune the clustering. Using two distinct processes—transmission, and the accumulation of measurable genetic variation—to define clusters carries the advantage that these processes may be estimable from data. This enables transmission clusters to be formed based on focused discussion and estimation of measurable processes rather than based on fixed cut-offs, and it allows ready adaptation for new pipelines that detect variation.

There are some limitations. Prior knowledge of the substitution and transmission processes is required, and there is some uncertainty in choosing appropriate values. However, the model is typically robust with respect to changes in these variables in particular, varying the transmission rate does not have a material impact on the clustering because a rescaling of the cut-off will compensate. The choice of a time-varying transmission function β ( t ) is, however, likely to have an impact on results. In particular we would expect a low probability of very quick transmission—as the pathogen numbers are building up in a new host—to have a significant impact, compared with the use of a constant transmission rate, as would a fast rate early diminishing to a much lower rate later. Note also that the parameter t in our model represents the total time since infection to both the sample dates: so we are not modeling the variation of transmission rates in calendar time.

In some diseases, such as TB, there is considerable variation in the latency period, during which the mutation rate may be lower than it is during active disease. This variability can be incorporated into the negative binomial model as expressed in equation (14). We do not explicitly model within-host diversity, though this is relevant to identifying direct transmission events ( Didelot et al. 2014, 2017 Worby et al. 2014 Hall et al. 2015, 2016). Cases of direct transmission will be clustered together with high probability in our method despite slight inaccuracy in the timing due to both branches of the pair’s two-case tree spending time in the same host. Pairs of cases for which the clustering decision is ambiguous are likely to have several intermediate cases between them, with a larger tree height, and so the contribution of in-host diversity in either sampled case will be small. In-host diversity in unsampled cases would not affect our estimates unless it contributed to changes in the molecular clock rate.

WGS data has been noted to be helpful in ruling out transmission but insufficient, on its own, to resolve transmission events ( Casali et al. 2016 Campbell et al. 2018). If the primary use of WGS data is only to refute transmission, one might ask why clustering matters. We would argue that the transmissions that are not refuted by WGS are then presumably considered to be possible recent, or direct, or clustered transmissions. Even if the primary use of WGS data is to refute direct transmission, there is a trade-off between the strength of that refutation and the possibility of mistakenly refuting genuine recent transmission events. This is more likely, using SNP cut-offs, where selection (say for antibiotic resistance) has led to higher SNP differences than expected. In addition, in practice WGS data are not only used to refute direct transmission, but to produce clusters that inform onward analyzes, reports on the extent of recent transmission, outbreak analysis and reconstruction and even public health policy see ( Guthrie et al. 2018) for one example.

We have accommodated the possibility of low substitution rates in latency with a non-Poisson model for the clock process, λ, in equation (5) (though we have not implemented this) and to some extent with the option of a nonconstant transmission rate. However, we have not modeled the possibility of a direct relationship between low SNP accumulation and low probability transmission. If this relationship exists—for example if latent cases both do not transmit and do not accumulate SNPs ( Colangeli et al. 2014)—then low SNP differences could correspond to fewer intermediate hosts despite long elapsed times. This is an implicit assumption of a SNP-only method although it may be correct it is a strong assumption, and there is evidence that mutation rates in latency are not reduced compared with active disease ( Ford et al. 2011 Lillebaek et al. 2016).

We have not used the probability of sampling in forming our clusters, in contrast to other tools including the vimes package ( Jombart and Cori 2017). For example, if it is known that surveillance is strong, then it would be less likely for 10 intermediate cases to be unsampled than for 5 intermediate cases, and this could be built in to a clustering method. Our rationale for not taking this into account is to provide a clustering approach that is as parallel as possible to the SNP cut-offs currently in widespread use while taking additional information on timing, molecular evolution, and transmission into account. It is often the case that the true sampling rate is not known and may change over time, and—particularly for TB in high-resource settings—cases can be missed because they are hard to identify (perhaps being at higher risk of TB due to homelessness or other factors, as in Casali et al. [2016]). In many settings the sampling probability may be uncertain. We have taken the approach of defining the clusters themselves without explicit reference to the sampling probability, with the view that the clusters are central inputs to other analyzes which will take sampling into account (as is done for example in TransPhylo [ Didelot et al. 2017]). However, in our approach, changes in the sampling probability would likely be apparent in changes in the temporal and genetic distance between cases over time.

We have also not modeled changes in the transmission process over time in a community (e.g. due to depletion of susceptible individuals, improved infection control, etc.). As with including sampling, this may best be done in a more nuanced analysis after the initial clustering rather than as part of the clustering itself, but in principle, changes to the transmission function over calendar time could be incorporated into the mathematics behind equation (8). However, this would raise interpretation challenges because of the fact that our transmission process reflects the rate of the pathogen moving between hosts where it is known that there is an infected host at the “end” of the chain (since each pair consists of two sampled hosts, whose pathogen was sequenced and who were therefore certainly infected). We do not model the number of contacts over which transmission could have occurred.

The choice of a particular SNP cut-off also takes no account of the inevitable uncertainties involved in the gathering and processing of raw read data, and does not allow for the modeling of this uncertainty. Different bioinformatics pipelines—and different parameters used within those pipelines—can have a substantial effect on the number of SNP differences reported between cases. It is usual for SNP differences to be taken as given and, although sometimes details are provided—see for example Katz et al. (2013)—it is important to recognize that there can be considerable variation between SNPs reported using different pipelines and parameters. For example, the level of quality scores and read depth cut-offs used will generally have a high impact, as will the precise way in which hypervariable sites and repeat regions are handled (or excluded). As technology improves we may begin to capture variation in repeat regions, or types of variation (e.g. insertions/deletions) that are currently masked, and in that new pipeline 12 SNPs may not carry the interpretation it does today. The model could easily incorporate more genomic information, resulting in a more sophisticated version of the distance function. In particular, large-scale genomic features can readily help to establish that cases belong to separate and therefore distantly related lineages. As variation-calling pipelines evolve, our method could be used to relate each pipeline to number of transmissions or to estimated divergence time this would form an approach to compare bioinformatics pipelines and data sources, and to curate their use in defining distances between isolates.

TB has distinct phylogeographic lineages which have been reported to have different mutation rates, with lineage 2 (the East Asian and Beijing lineage) having higher mutation rates than lineage 4 (Euro-American) ( Ford et al. 2013). Our approach could unify clustering despite such differences, as the same transmission and probability settings could be used under different SNP accumulation rates. This would provide a consistent approach to clustering in areas where multiple lineages cocirculate, and allow comparison of TB clustering patterns in different settings. The same would be true for adapting to differing natural histories across different pathogen lineages or subpopulations: the choice of β could reflect transmission differences while the other settings remained the same.

The long-term aim of changing how cases are assigned to clusters is to improve the way that WGS and epidemiological data are used and to best capture clusters that correspond to transmissions of an infectious disease. We have found that basing clusters on the number of transmission events, with a probabilistic cut-off, is feasible, can integrate timing and other data, and compares favorably to clustering based on SNP cut-offs.

Scientists find single letter of genetic code that makes African Salmonella so dangerous

Scientists at the University of Liverpool have identified a single genetic change in Salmonella that is playing a key role in the devastating epidemic of bloodstream infections currently killing around 400,000 people each year in sub-Saharan Africa.

Invasive non-typhoidal Salmonellosis (iNTS) occurs when Salmonella bacteria, which normally cause gastrointestinal illness, enter the bloodstream and spread through the human body. The African iNTS epidemic is caused by a variant of Salmonella Typhimurium (ST313) that is resistant to antibiotics and generally affects individuals with immune systems weakened by malaria or HIV.

In a new study published in PNAS, a team of researchers led by Professor Jay Hinton at the University of Liverpool have identified a specific genetic change, or single-nucleotide polymorphism (SNP), that helps the African Salmonella to survive in the human bloodstream.

Professor Hinton explained: "Pinpointing this single letter of DNA is an exciting breakthrough in our understanding of why African Salmonella causes such a devastating disease, and helps to explain how this dangerous type of Salmonella evolved."

SNPs represent a change of just one letter in the DNA sequence and there are thousands of SNP differences between different types of Salmonella. Until now, it has been hard to link an individual SNP to the ability of bacteria to cause disease.

Using a type of RNA analysis called transcriptomics, the scientists identified SNPs that affected the level of expression of important Salmonella genes. After studying 1000 different SNPs, they found a single nucleotide difference that is unique to the African ST313 strain and causes high expression of a virulence factor called PgtE that prevents Salmonella being killed in the bloodstream.

The scientists then used an advanced genetic technique to switch the SNP found in the African strain to the version found in the type of Salmonella that causes food poisoning and gastroenteritis globally. Finally, they used an animal infection model to show that the bacteria with the altered SNP had lost their ability to cause disease.

Professor Hinton added: "We've developed a new investigative approach to understand bacterial infection, which is the culmination of six years of work. This combination of genomics and transcriptomics could bring new insights to other important pathogens, and prepare us for future epidemics."

Professor Melita Gordon, a University of Liverpool clinician-scientist working in Malawi, who was involved in the project, said: "The ability of iNTS Salmonella strains to cause such serious disease leads to devastating and frequently fatal consequences for very young children, and for adults who may be the chief breadwinners in their homes and communities. We see iNTS disease placing an enormous burden on thinly-stretched local health facilities and hospitals in Malawi, particularly because diagnosis is difficult, and treatment options are limited. It is now urgent that a vaccine is developed to combat this dangerous infection."

The study received funding support from the Wellcome Trust and was carried out in collaboration with the Liverpool School of Tropical Medicine and the University of Birmingham.

Disclaimer: AAAS and EurekAlert! are not responsible for the accuracy of news releases posted to EurekAlert! by contributing institutions or for the use of any information through the EurekAlert system.


MethylToSNP overview

MethylToSNP predicts the location of SNPs affecting Illumina methylation array data. The program takes methylation array data for multiple samples (at least 50 samples recommended) as an input and generates a list containing the locations of all potential SNPs in the data set. After a three-tier pattern is identified, postprocessing can be performed with annotation of probes and SNPs (mainly based on dbSNP database [18]) available in Bioconductor. For instance, sites can be filtered according to their location within the probe or directly on the CpG site or probes could be stratified as known or potentially novel SNPs. MethylToSNP was created in the R programming language [24] as part of the R Bioconductor ecosystem. The typical workflow is illustrated in Fig. 2a, where the input data may be originating from a remote (e.g., GEO) or local source in the format of raw array signal or already preprocessed methylation values. MethylToSNP will accept user input in the format of beta-values or, preferably, in the format generated by the BioConductor package minfi. The latter is preferred because the minfi data format incorporates genomic mapping and SNP annotation of array probes.

Three-tier pattern with gaps

To detect a position where methylation values are affected by a SNP either at the target CpG or its neighboring position [5], the methylation data has to be discretely separated by two gaps of similar width, where these gaps contribute to the majority of the total data range (Fig. 3). The algorithm clusters methylation data into three clusters, favoring clusters located farther away from each other, and optionally disregards outliers, and then evaluates the gaps between clusters.

Because clustering of beta-values is a one-dimensional problem, and the number of clusters is low, it can be solved optimally with dynamic programming k-means implementation rather than with randomly initialized k-means algorithm that is not guaranteed to converge to an optimum. We relied on an implementation in R package Ckmeans.1d.dp [25].

Larger clusters will naturally have higher weight than clusters only consisting of a few data points. If untreated, this problem could lead to detection of multiple clusters in highly populated data ranges (e.g., beta-values 0.7–0.9). However, in fact, we are interested in detecting large and small clusters across the whole span of beta-values. Therefore, we used weights inversely proportional to the number of samples, i.e., inverse quantile density. For quantile ( q ) and the number of samples ( N_ ) clustering weights were calculated as follows:

Additional file 1: Figure S3 illustrates the effect of inverse quantile weighting on the YRI beta-values at cg21226234 probe.

The gap between clusters can be defined as the difference in methylation levels between the bordering samples in each cluster, for instance gap between clusters ( A ) and ( B ) , where a and b are methylation values of bordering samples, such that ( forall a in A > forall b in B ) :

After gaps are identified, a subsequent method is used to assess the size of the data-free gaps at each methylation site using two adjustable cutoffs: the ( < ext>]_< ext>\_< ext> ) value and the ( < ext>\_< ext> ) value. The ( < ext>\_< ext>\_< ext> ) approach evaluates the total gap size by summing the size of the gaps and testing whether it represents a majority of the β-value range. By contrast, the ( < ext>\_< ext> ) approach compares sizes among the two largest gap regions and tests whether their relative sizes are roughly equivalent. To pass this threshold, the size of the smaller gap must be at least a certain percentage of the larger gap. For example, if the ( < ext>\_< ext> ) is set to 0.75, and the larger gap spans 0.3 β-value, the smaller gap must span at least 0.225 β-value. For the algorithm to identify possible SNP locations, thresholds for both the ( < ext>\_< ext>\_< ext> ) and the ( < ext>\_< ext> ) must be met. This method allows for variability in the methylation values, while still covering a majority of the whitespace, caused by compression of the β-value range away from upper or lower boundaries of 1.0 and 0, respectively. Additionally, we benefit by avoiding use of a fixed cutoff to separate methylation values into levels, such as thirds or quadrants. As shown in Fig. 3b, it is typically impossible to define fixed cutoffs that would work for all probes.

Considering the two gaps between three clusters ( < ext>, < ext>, < ext> ) —“high”, “mid” and “low”: ( d_<<< ext> - < ext>>> ) and ( d_<<< ext> - < ext>>> ) , the threshold parameters ( < ext>\_< ext> ) and ( < ext>\_< ext>\_< ext> ) for the algorithm are defined as:

where the denominator is the total range of beta-values across all three clusters.

Calibrating default MethylToSNP parameters

First, two simulated data sets were created to test the ability of MethylToSNP to identify SNP-associated methylation patterns when different proportions of samples (i.e., data points) were present at each tier level. The datasets included 95 samples each, to mimic the size of the southern African data set, and circa 10,000 probe loci. In both data sets, half of the probes corresponded to non-SNPs that were drawn from the actual southern African data. The second half of the probes represented SNPs and were generated in a different way depending on the set: in the “set-frequency” dataset unequal distribution of methylation values across the tiers was generated, corresponding to low minor allele frequency (MAF) scenario, whereas in the “uniform-frequency” dataset the methylation values were distributed equally across the tiers, simulating the high MAF scenario, characteristic for common SNPs. The procedure is described in more detail in Additional file 1, along with the set frequencies and the code to reproduce the data. We used these simulated datasets to calibrate the default values of MethylToSNP parameters: the ( < ext>\_< ext>\_< ext> ) and the ( < ext>\_< ext> ) . To choose the defaults ( ( < ext>\_< ext>\_< ext> = 0.50 ) , ( < ext>\_< ext> = 0.75 ) ), the parameters were altered in 0.05 increments (see Additional file 1: Figure S1). With these parameter thresholds, the benchmark returned 97% true positive rate on “set-frequency” dataset. The uniformly simulated data set returned 100% true positive rate. In all cases there were no false positives.

However, the simulated SNP probes had a clear separation between the tiers of methylation values, thus making it difficult to assess the performance in case of presence of noise or other confounding factors.

Therefore, we created a second benchmark to assess false negative rates using the 59 control SNP probes placed by array designers on the Illumina EPIC arrays. Also to demonstrate the use of the approach on the Illumina EPIC we tested 152 pediatric samples from GEO GSE137682 dataset, where MethylToSNP with default parameters identified 41 out of 59 positions for 27% false negative rate (Additional file 1: Figure S2). However, we note that 18 control SNPs were A > G transitions or located further away than 2 bp from the CG position on the array, which we would not intend to find with our first pass approach. The remaining C > T and T > C (14 and 15, respectively) and G > A (12 total) were correctly identified.

The benchmark figures (Additional file 1: Figure S2A, B) showed that the ( < ext>_< ext> ) value can be lowered from 0.75 to 0.50 to retrieve more hits. However, the major hindrance to detection of gap patterns is the presence of noise or otherwise confounded measurements with methylation values between the tiers. In order to make the method insensitive to such measurements we implemented an outlier detection option ( < ext>_< ext> ) that is the measurement of the allowed within-cluster variance (in standard deviations). For instance, a sample with beta-value ( eta ) is an outlier in the cluster ( < ext> ) with the cluster center ( mu_< ext> ) and variance ( sigma_< ext>^ <2>) if the following threshold is not satisfied:

In case when the outlier filtering option is enabled, any beta-value that belongs to a cluster but does not match the threshold would be excluded from the calculation of gaps between clusters. An additional benchmark run with outlier filtering enabled (Additional file 1: Figure S2D, E) showed that this option completely rescued retrieval, with zero false negatives, even in complicated cases.

We encourage users to use our benchmarks as a guidance for changing the default parameter values. Alternatively, users can recalibrate the thresholds using their own predefined control probes, for instance known SNPs, or simulated datasets.

Size of the dataset required for the analysis

The algorithm relies on identification of three clusters, therefore the absolute minimum number of samples required for the analysis is three. However, the SNP patterns may only be detectable with larger datasets, particularly for the rare alleles. While the low MAF SNPs will set the upper detection boundary, we wanted to calibrate the lower boundary, i.e., the minimal recommended number of samples for the analysis based on common SNPs with MAF close to 0.50. We used the false negative detection rate of SNP control probes for the 152 pediatric samples from GEO GSE137682 dataset as a benchmark (Additional file 1: Figure S2C). The plot shows how many true SNP probes are retrieved in case of subsampling without replacement from 5 to 150 data points out of 152, with a step of 5, with 30 replicates. The saturation is reached at about 50 samples (i.e., data points). Removal of outliers improves the overall retrieval however, it does not affect the lowest boundary on the number of samples required to find the three-tier methylation pattern (Additional file 1: Figure S2F). Based on this benchmark we, therefore, recommend that the size of the datasets analyzed with MethylToSNP should not be smaller than 50 samples. The program will run with 3 or more samples but will print a warning message if supplied data is insufficient for reliable detection of SNPs.

SNP-reliability score and thresholds

MethylToSNP quantitatively assesses how close the observed methylation pattern resembles the expected meC > T SNP by providing a reliability score. In general, the majority of sites that MethylToSNP identifies are meC > T SNPs, or neighboring sites affecting the probe. In these cases, C is the major allele and is consistently methylated. When replaced by a T allele, a false signal of differential methylation appears. By contrast, an unmethylated C major allele will give the same methylation value as a T allele. The reliability score ( R ) represents a weighted measure based on the appearance of the data points for a given probe in the three β-value tiers, defined as “high” (> 0.75), “low” (< 0.25) and “middle” (between 0.25 and 0.75), with number of samples in each tier represented as ( N_< ext> , N_< ext> , N_< ext> ) , respectively:

If methylation values are falling in fewer than three tiers the reliability score of 0 is assigned.

We apply this stringent scoring approach to refine our datasets to those spanning the largest beta-value range, i.e., at the target CpG or the second position, as these locations have the greatest potential to impact the p values calculated for differential methylation between comparison groups.

To assess the reliability threshold necessary for calling SNP positions affecting the methylation interpretation, we calculated the scores for the simulated benchmark with two generated datasets (see Additional file 1). For the dataset with predetermined ratios of data points at each tier (which includes SNPs with low MAF) the mean reliability score was 0.568, whereas for SNPs with uniform distribution of methylation across tiers (corresponding to high MAF) mean reliability was 0.501 (Table 7). We assigned the threshold of 0.50 to reliability scores, with approximately 75% of all examples in the more realistic set-frequency dataset passing the threshold. When the data points are distributed mainly between the top two levels, this approach creates a theoretical reliability score of 0.75, whereas 0.50 is the expected value when all samples are evenly distributed across all three levels. Therefore, a higher reliability score represents a greater likelihood of the target site harboring an uncharacterized C to T SNP, consistent with a low-frequency T polymorphism being present and a higher concentration of samples falling within the top two tiers.

YRI HapMap dataset

We next tested MethylToSNP on data from YRI HapMap samples, some of which have both methylation and genotype data available. Methylation data were downloaded from Gene Expression Omnibus (GEO) project GSE26133 [16] for 77 samples and corresponding genotype data for available samples were found in the 1000 Genomes Browser ( [26]. One caveat with the browser data is that there were not genotype data at some methylation sites of interest for the samples which appeared polymorphic. For targeted sequencing, DNA samples were ordered from the Coriell depository and Sanger sequenced. The same samples were also subjected to targeted bisulfite sequencing to verify the methylation levels observed from the Illumina 450K methylation chip analysis.

CEU HapMap dataset

Another group of well-studied samples, from individuals that likely have a very different epigenetic profile and genetic and life history from the individuals who contributed to the YRI (i.e., Yoruba in Ibadan, Nigeria) datasets, the CEU HapMap dataset, includes data from 90 Utah residents with Northern and Western European ancestry. Illumina 27K methylation data from the CEU sample set (from GEO project GSE27146 [17]) were subjected to MethylToSNP analysis.

Southern African data analysis

To test MethylToSNP on primary samples, we used an in-house methylation dataset acquired from whole blood collected from peoples ethno-linguistically self-identifying as either KhoeSan or Bantu of Namibia, as in [27]. Few genomic data exist for these populations less than ten genomes have been fully sequenced to date [21]. These populations harbor the greatest amount of genomic diversity, specifically the earliest diverged human lineage represented by people of KhoeSan ancestry [21], and population-specific SNPs are recorded in dbSNP. Nevertheless, many unidentified SNPs in this group may affect the interpretation of methylation studies—and MethylToSNP may detect them. Also, previously identified polymorphisms may not be present in the samples used in this study. The sample set contained 95 samples, 40 were KhoeSan, 51 were non-KhoeSan or Bantu-speaking southern Africans, and six were geographically matched Namibians of European descent, with two of the European controls run in duplicate for comparison. All samples were run on the Illumina 450K methylation chip (manuscript in preparation). The KhoeSan and control data were used to find sites that were differentially methylated between these two groups. This data set is broken down into three subsets for analysis: (i) all quality controlled methylation data from the chip (473,767 sites), (ii) all sites that are differentially methylated between the KhoeSan group and control group based on Mann–Whitney U tests (p ≤ 0.05) with Bonferroni test correction (q ≤ 0.05 12,631 sites), (iii) the top 5% of differential methylation sites, ranked by largest magnitude of absolute difference, which are also statistically significant with Mann–Whitney U tests (p ≤ 0.05) and Bonferroni test correction (q ≤ 0.05), where known SNP positions are removed (400 sites).

Regions of particular interest: CTCF sites and enhancers elements

We took an in-depth look at enhancer and CTCF sites implicated in differential methylation, where potential novel SNP content could confound methylation analysis. For example, a finding of differential methylation in a CTCF site could inhibit CTCF binding [28], as demonstrated at imprint control regions, such as IGF2 and H19, where allele-specific methylation [29] inhibits binding. A SNP could also inhibit CTCF binding and present as differential methylation, impeding correct biological interpretation. Using the southern African dataset, we investigated how many differential methylation sites address these alternatives. The CTCF site locations were downloaded from the University of California, Santa Cruz Genome Browser [22, 30]. Likewise, sites of differential methylation that overlap known enhancer regions were intersected with our data to determine whether enhancer function could be impacted by the presence of SNPs or differential methylation. Enhancer site locations were downloaded with the Illumina 450K array annotation file and were originally compiled by Illumina from ENCODE projects. In order to maintain consistency of annotations in CTCF site analysis, we also downloaded a 450K array dataset (GEO GSE39672) for YRI and CEU HapMap samples.