How to get Population Genotype Frequency from 1000 genomes Perl API

How to get Population Genotype Frequency from 1000 genomes Perl API

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

Posted a similar question on Biostars but got no response. Not sure if I'm allowed to link to it? Basically I want to pull genotype frequency data for a population group (such as CEU) instead of allele frequency data, via the Perl API for 1000 genomes. I have tabix and perl API installed. This is for 100,000+ SNPs so the solution should hopefully not involve manually downloading genotype and calculating the frequency manually in a for-loop. My understanding also is no solution exists in BioMart (according to the Biostars answer).

Following the instructions here, I can see how to get specific genotypes for all individuals as a list.

Eg: Input -> Some snp, CEU… Output -> G/G 0.87, G/A 0.13 (frequency of some snp for CEU population).

I need to do this for 100,000+ SNPs, so I imagine manually pulling all genotypes for each SNP and calculating the frequency manually in a loop would not be practical.

If you want population specific allele frequencies you have three options: * For a single variant you can look at the population genetics page for a variant in our browser. This gives you piecharts and a table for a single site. * For a genomic region you can use our allele frequency calculator tool which gives a set of allele frequencies for selected populations * If you would like sub population allele frequences for a whole file, you are best to use the vcftools command line tool.

This is done using a combination of two vcftools commands called vcf-subset and fill-an-ac

An example command set using files from our phase 1 release would look like

grep CEU integrated_call_samples.20101123.ALL.panel | cut -f1 > CEU.samples.list

vcf-subset -c CEU.samples.list ALL.chr13.integrated_phase1_v3.20101123.snps_indels_svs.genotypes.vcf.gz | fill-an-ac | bgzip -c > CEU.chr13.phase1.vcf.gz

Once you have this file you can calculate your frequency by dividing AN (allele number) by AC (allele count)

Please note that some early VCF files from the main project used LD information and other variables to help estimate the allele frequency. This means in these files the AF does not always equal AC/AN. In the phase 1 and phase 3 releases, AC/AN should always match the allele frequency quoted.

Another possibility is to use glactools. I will show this by using data from the 1000 genomes.

First, we download the chromosome names and length for the reference:


We then require information on the panels from the VCF file:

wget grep -v ^sample integrated_call_samples_v3.20130502.ALL.panel.txt | cut -f 1,3 > panel.txt

You can run the following to transform VCF into allele counts format (ACF):

tabix -h 2:136486829-136653337 |glactools vcfm2acf --onlyGT --fai human_g1k_v37.fasta.fai - |glactools meld -f panel.txt - | glactools view -h -

This will print the allele counts:

#chr coord REF,ALT root anc AFR AMR EAS EUR SAS 2 136486850 G,T 0,0:0 0,0:0 1322,0:0 694,0:0 1008,0:0 1006,0:0 974,4:0 2 136486967 C,T 0,0:0 0,0:0 1321,1:0 694,0:0 1007,1:0 1006,0:0 978,0:0 2 136487007 C,T 0,0:0 0,0:0 1322,0:0 694,0:0 1007,1:0 1006,0:0 978,0:0 2 136487181 C,T 0,0:0 0,0:0 1322,0:0 693,1:0 1008,0:0 1005,1:0 978,0:0 2 136487214 G,A 0,0:0 0,0:0 1321,1:0 694,0:0 1008,0:0 1006,0:0 978,0:0 2 136487246 G,A 0,0:0 0,0:0 1282,40:0 693,1:0 1008,0:0 1006,0:0 978,0:0 2 136487336 G,T 0,0:0 0,0:0 1322,0:0 694,0:0 1008,0:0 1006,0:0 977,1:0 2 136487417 G,A 0,0:0 0,0:0 1321,1:0 693,1:0 1008,0:0 1006,0:0 978,0:0 2 136487504 A,C 0,0:0 0,0:0 1316,6:0 694,0:0 1008,0:0 1006,0:0 978,0:0

Refer to the Ensembl Core tutorial for a good description of the coding conventions normally used in Ensembl. We try as much as possible to stick to these rules in Variation.

Connecting to an Ensembl Variation database is made simple by using the Bio::EnsEMBL::Registry module:

The use of the registry ensures you will load the correct versions of the Ensembl databases for the software release it can find on a database instance. Using the registry object, you can then create any of number of database adaptors. Each of these adaptors is responsible for generating an object of one type. The Ensembl variation API uses a number of object types that relate to the data stored in the database. For example, in order to generate variation objects, you should first create a variation adaptor:

The get_adaptor method will automatically create a connection to the relevant database in the example above, a connection will be made to the variation database for human. The three parameters passed specify the species, database and object type you require. Below is a non exhaustive list of Ensembl variation adaptors that are most often used

  • IndividualAdaptor to fetch Bio::EnsEMBL::Variation::Individual objects
  • LDFeatureContainerAdaptor to fetch Bio::EnsEMBL::Variation::LDFeatureContainer objects
  • PopulationAdaptor to fetch Bio::EnsEMBL::Variation::Population objects
  • ReadCoverageAdaptor to fetch Bio::EnsEMBL::Variation::ReadCoverage objects
  • TranscriptVariationAdaptor to fetch Bio::EnsEMBL::Variation::TranscriptVariation objects
  • VariationAdaptor to fetch Bio::EnsEMBL::Variation::Variation objects
  • VariationFeatureAdaptor to fetch Bio::EnsEMBL::Variation::VariationFeature objects

Only some of these adaptors will be used for illustration as part of this tutorial through commented perl scripts code.

Are the IGSR variants available in genome browsers?

1000 Genomes Project data is available at both Ensembl and the UCSC Genome Browser.

Ensembl provides consequence information for the variants. The variants that are loaded into the Ensembl database and have consequence types assigned are displayed on the Variation view. Ensembl can also offer consequence predictions using their Variant Effect Predictor (VEP).

You can see individual genotype information in the Ensembl browser by looking at the Individual Genotypes section of the page from the menu on the left hand side.

The files are all gzipped compressed and the format looks like this, with a four-line repeating pattern

Many of our individuals have multiple fastq files. This is because many of our individual were sequenced using more than one run of a sequencing machine.

Each set of files named like ERR001268_1.filt.fastq.gz, ERR001268_2.filt.fastq.gz and ERR001268.filt.fastq.gz represent all the sequence from a sequencing run.

The labels with _1 and _2 represent paired-end files mate1 is found in a file labelled _1 and mate2 is found in the file labelled _2. The files which do not have a number in their name are singled ended reads, this can be for two reasons, some sequencing early in the project was singled ended, also, as we filter our fastq files as described in our README if one of a pair of reads gets rejected the other read gets placed in the single file.

When a individual has many files with different run accessions (e.g ERR001268), this means it was sequenced multiple times. This can either be for the same experiment, some centres used multiplexing to have better control over their coverage levels for the low coverage sequencing, or because it was sequenced using different protocols or on different platforms.

For a full description of the sequencing conducted for the project please look at our sequence.index file

The tools

  • fill-aa
  • fill-an-ac
  • fill-fs
  • fill-ref-md5
  • fill-rsIDs
  • vcf-annotate
  • vcf-compare
  • vcf-concat
  • vcf-consensus
  • vcf-contrast
  • vcf-convert
  • vcf-filter
  • vcf-fix-newlines
  • vcf-fix-ploidy
  • vcf-indel-stats
  • vcf-isec
  • vcf-merge
  • vcf-phased-join
  • vcf-query
  • vcf-shuffle-cols
  • vcf-sort
  • vcf-stats
  • vcf-subset
  • vcf-to-tab
  • vcf-tstv
  • vcf-validator


Fill or recalculate AN and AC INFO fields.

zcat file.vcf.gz | fill-an-ac | bgzip -c > out.vcf.gz


Annotates the VCF file with flanking sequence (INFO/FS tag) masking known variants with N's. Useful for designing primers.

fill-fs -r /path/to/refseq.fa | vcf-query '%CHROM %POS %INFO/FS ' >


Fill missing reference info and sequence MD5s into VCF header.

fill-ref-md5 -i "SP:Homo Sapiens" -r ref.fasta in.vcf.gz -d ref.dict out.vcf.gz


Fill missing rsIDs. This script has been discontinued, please use vcf-annotate instead.


The script adds or removes filters and custom annotations to VCF files. To add custom annotations to VCF files, create TAB delimited file with annotations such as

#CHR FROM TO ANNOTATION 1 12345 22345 gene1 1 67890 77890 gene2

Compress the file (using bgzip annotations ), index (using tabix -s 1 -b 2 -e 3 annotations.gz ) and run

cat in.vcf | vcf-annotate -a annotations.gz
-d key=INFO,ID=ANN,Number=1,Type=Integer,Description='My custom annotation'

The script is also routinely used to apply filters. There are a number of predefined filters and custom filters can be easily added, see vcf-annotate -h for examples. Some of the predefined filters take advantage of tags added by bcftools, the descriptions of the most frequently asked ones follow:

Note: A fast htslib C version of this tool is now available (see bcftools annotate).


Compares positions in two or more VCF files and outputs the numbers of positions contained in one but not the other files two but not the other files, etc, which comes handy when generating Venn diagrams. The script also computes numbers such as nonreference discordance rates (including multiallelic sites), compares actual sequence (useful when comparing indels), etc.

vcf-compare -H A.vcf.gz B.vcf.gz C.vcf.gz

Note: A fast htslib C version of this tool is now available (see bcftools stats).


Concatenates VCF files (for example split by chromosome). Note that the input and output VCFs will have the same number of columns, the script does not merge VCFs by position (see also vcf-merge).

In the basic mode it does not do anything fancy except for a sanity check that all files have the same columns. When run with the -s option, it will perform a partial merge sort, looking at limited number of open files simultaneously.

vcf-concat A.vcf.gz B.vcf.gz C.vcf.gz | gzip -c > out.vcf.gz


Apply VCF variants to a fasta file to create consensus sequence.

cat ref.fa | vcf-consensus file.vcf.gz > out.fa


Convert between VCF versions, currently from VCFv3.3 to VCFv4.0.

zcat file.vcf.gz | vcf-convert -r reference.fa > out.vcf


A tool for finding differences between groups of samples, useful in trio analysises, cancer genomes etc.

In the example below variants with average mapping quality of 30 ( -f MinMQ=30 ) and minimum depth of 10 ( -d 10 ) are considered. Only novel alleles are reported ( -n ). Then vcf-query is used to extract the INFO/NOVEL* annotations into a table. Finally the sites are sorted by confidence of the site being different in the child ( -k5,5nr ).

vcf-annotate -f MinMQ=30 file.vcf | vcf-contrast -n +Child -Mother,Father -d 10 -f | vcf-query -f '%CHROM %POS %INFO/NOVELTY %INFO/NOVELAL %INFO/NOVELGT[ %SAMPLE %GTR %PL] ' | sort -k3,3nr | head


Please take a look at vcf-annotate and bcftools view which does what you are looking for. Apologies for the non-intuitive naming.
Note: A fast HTSlib C version of a filtering tool is now available (see bcftools filter and bcftools view).


Fixes diploid vs haploid genotypes on sex chromosomes, including the pseudoautosomal regions.


Fixes diploid vs haploid genotypes on sex chromosomes, including the pseudoautosomal regions.


Note: A fast htslib C version of this tool is now available (see bcftools stats).


Creates intersections and complements of two or more VCF files. Given multiple VCF files, it can output the list of positions which are shared by at least N files, at most N files, exactly N files, etc. The first example below outputs positions shared by at least two files and the second outputs positions present in the files A but absent from files B and C.

vcf-isec -n +2 A.vcf.gz B.vcf.gz | bgzip -c > out.vcf.gz
vcf-isec -c A.vcf.gz B.vcf.gz C.vcf.gz | bgzip -c > out.vcf.gz

Note: A fast htslib C version of this tool is now available (see bcftools isec).


Merges two or more VCF files into one so that, for example, if two source files had one column each, on output will be printed a file with two columns. See also vcf-concat for concatenating VCFs split by chromosome.

vcf-merge A.vcf.gz B.vcf.gz C.vcf.gz | bgzip -c > out.vcf.gz

Note that this script is not intended for concatenating VCF files. For this, use vcf-concat instead.
Note: A fast htslib C version of this tool is now available (see bcftools merge).


Concatenates multiple overlapping VCFs preserving phasing.


Powerful tool for converting VCF files into format defined by the user. Supports retrieval of subsets of positions, columns and fields.

vcf-query file.vcf.gz 1:10327-10330
vcf-query file.vcf -f '%CHROM:%POS %REF %ALT [ %DP] '

Note: A fast htslib C version of this tool is now available (see bcftools query).


vcf-shuffle-cols -t template.vcf.gz file.vcf.gz > out.vcf



Outputs some basic statistics: the number of SNPs, indels, etc.

Note: A fast htslib C version of this tool is now available (see bcftools stats).


Remove some columns from the VCF file.

vcf-subset -c NA0001,NA0002 file.vcf.gz | bgzip -c > out.vcf.gz

Note: A fast HTSlib C version of this tool is now available (see bcftools view).


A lightweight script for quick calculation of Ts/Tv ratio.

Note: A fast htslib C version of this tool is now available (see bcftools stats).


A simple script which converts the VCF file into a tab-delimited text file listing the actual variants instead of ALT indexes.

zcat file.vcf.gz | vcf-to-tab >


For examples how to use the Perl API, it is best to look at some of the simpler scripts, for example vcf-to-tab . The detailed documentation can be obtained by running

How to get Population Genotype Frequency from 1000 genomes Perl API - Biology

Population-scale Genomics Data Augmentation Based on Conditional Generative Adversarial Networks

Although next generation sequencing technologies have made it possible to quickly generate a large collection of genomes, it is sometimes infeasible (e.g. in rare disease studies where samples are limited) to quickly produce a large number of genomes. Additionally, due to privacy and security concerns, human genomics data are not readily or widely accessible. Models built on small and imbalanced datasets can thus be biased or inaccurate, as a result, conclusions can be error-prone or unfair. In order to address this problem, we develop a Population-scale Genomic Data Augmentation based on Conditional Generative Adversarial Networks (PG-cGAN) to enhance the amount and diversity of genomic data by transforming samples already in the data rather than collecting new samples. PG-cGAN is stacked with convolutional layers, aiming to capture underlying structures such as linkage disequilibrium (LD) patterns in genomic data. We demonstrated the application of the proposed PG-cGAN model to augment human genotypes data for human leukocyte antigen (HLA) regions, using genotypes from the 1000 Genomes project with 2,504 samples from five super-populations worldwide. Our results for augmenting genotypes in human leukocyte antigen (HLA) regions showed that PC-cGAN can generate new genotypes with similar population structure, variant frequency distributions and LD patterns. PG-cGAN can also generate and augment human genomic data for any specific population with corresponing population label as input condition information. This advantange of flexiable augmentation makes PG-cGAN great potential to improve reliability and fairness of downstream analysis. Since the only input for PC-cGAN is the original genomic data without assumptions about model parameters or data distributions, it can be extended to enrich many other types of biomedical data and beyond.

(a) Architecture of conditional GAN on genomics data (b) Architectures of the generator and discriminator

Fig. 1. Architecture of the proposed PG-cGAN model for genomics data augmentation. The generator accepts a population label as a condition, and then embeds it to the same dimension with a noise vector, in order to join with the noise input by multiplication. The discriminator also accepts a population label as a condition, and then embeds it to the same dimension with the genotype vector, in order to join with the genotype input by multiplication.

These instructions will get you a copy of the project up and running on your local machine for development and testing purposes. The code is performed in a Jupyter Notebook, because Jupyter Notebook provides an interactivly develop and test manner, which is very convient for practical purpose.

Junjie Chen, Mohammad Erfan Mowlaei, and Xinghua Shi*. 2020. Population-scale Genomics Data Augmentation Based on Conditional Generative Adversarial Networks. In Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics (BCB ’20), September 21–24, 2020, Virtual Event, USA. ACM, NewYork, NY, USA, 6 pages.


interPopula provides a Python API to access the HapMap dataset. Interfaces to all HapMap phases are supported including phase 2 data with fewer populations but more SNPs genotyped per individual and phase 3 covering more populations. interPopula provides access to frequency, genotype, linkage disequilibrium and phasing datasets. The recent CNV dataset is also supported along with family relationships for the 5 populations where sampling was performed for family trios (mother, father and one offspring).

Support for annotation information that is commonly needed to process HapMap data is also provided through an API to both the UCSC Known Genes dataset [8] from the UCSC genome browser database [9] and the Ensembl gene annotation database [10].

The API was constructed according to the following design guidelines:

1. The API is straightforward and self-contained. The core API requires only a Python interpreter, has no extra dependencies and minimal administrative overhead.

2. Downloaded data is stored on an SQL database for faster access. All data is stored using sqlite [11] which is natively supported in Python thus lowering the maintenance costs of the system. interPopula can also be connected to enterprise-grade databases which support multiple users, concurrent usage and large datasets for which the standard sqlite backend might not be enough (a PostgreSQL example is provided).

3. Data management (i.e. downloading from the HapMap site and local database construction) is fully automated: the required data subset is downloaded on demand only once and stored locally, reducing the load on both the client and server.

4. While SQL interfaces are made available from both the UCSC and Ensembl projects for their annotation databases, interPopula uses the same implementation strategy for the HapMap dataset: files are intelligently downloaded and locally stored. This provides a consistent interface to these two datasets which provide important annotation information frequently used to process HapMap data.

5. The framework is extensible and designed to be easily integrated with other Python tools and external databases. The web site provides several examples of integration with standard tools used in Python for bioinformatics such as Biopython [12], NumPy [13] and matplotlib [14].

6. Integration with Biopython allows for access to the Entrez SNP database and the population genetics tools supported by Biopython such as Genepop [15] allowing automated analysis of datasets.

7. Facilities to export HapMap data to Genepop format are provided enabling (non-automated) analysis of the HapMap dataset with the plethora of population genetics software which support this format. Data export can also be use to initialize population genetics simulators like the Python-based simuPOP [16] allowing computational simulations to be initialised with real datasets.

8. A large set of scripts is included, serving both as utilities to analyse the data, as well as examples of database and external tool integration. Currently we provide examples of integration with Entrez databases (nucleotide and SNP), the Genepop population genetics suite and charting libraries.

9. A set of guidelines and scripts was developed in order to facilitate a consistent view across heterogeneous databases. HapMap, Ensembl, UCSC Known Gene and the Entrez databases might not be fully consistent among themselves and, if care is not taken, database integration efforts might lead to erroneous results. The main pitfall is the usage of different NCBI reference builds across different databases, most notably HapMap is still based on build 36 whereas other databases either support multiple builds or only the most recent build 37.

10. A robust open-source software development process is put in place: a full public web based platform (hosted on Launchpad) is used to maintain the code infrastructure and unit tests approach 100% coverage.


MacArthur J, Bowler E, Cerezo M, Gil L, Hall P, Hastings E, Junkins H, McMahon A, Milano A, Morales J, et al. The new NHGRI-EBI Catalog of published genome-wide association studies (GWAS Catalog). Nucleic Acids Res. 201745:D896–901.

Hindorff LA, Sethupathy P, Junkins HA, Ramos EM, Mehta JP, Collins FS, Manolio TA. Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc Natl Acad Sci U S A. 2009106:9362–7.

Popejoy AB, Fullerton SM. Genomics is failing on diversity. Nature. 2016538:161–4.

Manolio TA. In retrospect: a decade of shared genomic associations. Nature. 2017546:360–1.

Martin AR, Gignoux CR, Walters RK, Wojcik GL, Neale BM, Gravel S, Daly MJ, Bustamante CD, Kenny EE. Human demographic history impacts genetic risk prediction across diverse populations. Am J Hum Genet. 2017100:635–49.

Bustamante CD, Burchard EG, De la Vega FM. Genomics for the world. Nature. 2011475:163–5.

Marigorta UM, Navarro A. High trans-ethnic replicability of GWAS results implies common causal variants. PLoS Genet. 20139:e1003566.

Palmer C, Pe’er I. Statistical correction of the Winner’s curse explains replication variability in quantitative trait genome-wide association studies. PLoS Genet. 201713:e1006916.

Shriner D. Mixed ancestry and disease risk transferability. Curr Genet Med Reports. 20153:151–7.

Coram MA, Fang H, Candille SI, Assimes TL, Tang H. Leveraging multi-ethnic evidence for risk assessment of quantitative traits in minority populations. Am J Hum Genet. 2017101:218–26.

Hindorff LA, Bonham VL, Brody LC, Ginoza MEC, Hutter CM, Manolio TA, Green ED. Prioritizing diversity in human genomics research. Nat Rev Genet. 201819:175–85.

Chatterjee N, Shi J, Garcia-Closas M. Developing and evaluating polygenic risk prediction models for stratified disease prevention. Nat Rev Genet. 201617:392–406.

International Schizophrenia Consortium, Purcell SM, Wray NR, Stone JL, Visscher PM, O’Donovan MC, Sullivan PF, Sklar P. Common polygenic variation contributes to risk of schizophrenia and bipolar disorder. Nature. 2009460:748–52.

Shi J, Park JH, Duan J, Berndt ST, Moy W, Yu K, Song L, Wheeler W, Hua X, Silverman D, et al. Winner’s curse correction and variable thresholding improve performance of polygenic risk modeling based on genome-wide association study summary-level data. PLoS Genet. 201612:e1006493.

Corona E, Chen R, Sikora M, Morgan AA, Patel CJ, Ramesh A, Bustamante CD, Butte AJ. Analysis of the genetic basis of disease in the context of worldwide human relationships and migration. PLoS Genet. 20139:e1003447.

Manolio TA, Collins FS, Cox NJ, Goldstein DB, Hindorff LA, Hunter DJ, McCarthy MI, Ramos EM, Cardon LR, Chakravarti A, et al. Finding the missing heritability of complex diseases. Nature. 2009461:747–53.

Wray NR, Yang J, Hayes BJ, Price AL, Goddard ME, Visscher PM. Pitfalls of predicting complex traits from SNPs. Nat Rev Genet. 201314:507–15.

McClellan J, King MC. Genetic heterogeneity in human disease. Cell. 2010141:210–7.

Warnecke RB, Oh A, Breen N, Gehlert S, Paskett E, Tucker KL, Lurie N, Rebbeck T, Goodwin J, Flack J. Approaching health disparities from a population perspective: the National Institutes of Health Centers for Population Health and Health Disparities. Am J Public Health. 200898:1608–15.

Woolf SH, Braveman P. Where health disparities begin: the role of social and economic determinants--and why current policies may make matters worse. Health Aff (Millwood). 201130:1852–9.

1000 Genomes Project Consortium. A global reference for human genetic variation. Nature. 2015526:68–74.

Li JZ, Absher DM, Tang H, Southwick AM, Casto AM, Ramachandran S, Cann HM, Barsh GS, Feldman M, Cavalli-Sforza LL, Myers RM. Worldwide human relationships inferred from genome-wide patterns of variation. Science. 2008319:1100–4.

Laberge AM, Michaud J, Richter A, Lemyre E, Lambert M, Brais B, Mitchell GA. Population history and its impact on medical genetics in Quebec. Clin Genet. 200568:287–301.

Macgregor S, Bellis C, Lea RA, Cox H, Dyer T, Blangero J, Visscher PM, Griffiths LR. Legacy of mutiny on the bounty: founder effect and admixture on Norfolk Island. Eur J Hum Genet. 201018:67–72.

Timpson NJ, Greenwood CMT, Soranzo N, Lawson DJ, Richards JB. Genetic architecture: the shape of the genetic contribution to human traits and disease. Nat Rev Genet. 201819:110–24.

Visscher PM, Wray NR, Zhang Q, Sklar P, McCarthy MI, Brown MA, Yang J. 10 years of GWAS discovery: biology, function, and translation. Am J Hum Genet. 2017101:5–22.

Lohmueller KE. The distribution of deleterious genetic variation in human populations. Curr Opin Genet Dev. 201429:139–46.

Henn BM, Botigue LR, Peischl S, Dupanloup I, Lipatov M, Maples BK, Martin AR, Musharoff S, Cann H, Snyder MP, et al. Distance from sub-Saharan Africa predicts mutational load in diverse human genomes. Proc Natl Acad Sci U S A. 2016113:E440–9.

Jones D. A WEIRD view of human nature skews psychologists’ studies. Science. 2010328:1627.

Henrich J, Heine SJ, Norenzayan A. Most people are not WEIRD. Nature. 2010466:29.

Logan DC. Known knowns, known unknowns, unknown unknowns and the propagation of scientific enquiry. J Exp Bot. 200960:712–4.

Pulit SL, Voight BF, de Bakker PI. Multiethnic genetic association studies improve power for locus discovery. PLoS One. 20105:e12600.

Clark AG, Hubisz MJ, Bustamante CD, Williamson SH, Nielsen R. Ascertainment bias in studies of human genome-wide polymorphism. Genome Res. 200515:1496–502.

McCarthy MI, Abecasis GR, Cardon LR, Goldstein DB, Little J, Ioannidis JP, Hirschhorn JN. Genome-wide association studies for complex traits: consensus, uncertainty and challenges. Nat Rev Genet. 20089:356–69.

Nielsen R. Population genetic analysis of ascertained SNP data. Hum Genomics. 20041:218–24.

Lachance J, Tishkoff SA. SNP ascertainment bias in population genetic analyses: why it is important, and how to correct it. Bioessays. 201335:780–6.

Albrechtsen A, Nielsen FC, Nielsen R. Ascertainment biases in SNP chips affect measures of population divergence. Mol Biol Evol. 201027:2534–47.

Lachance J. Disease-associated alleles in genome-wide association studies are enriched for derived low frequency alleles relative to HapMap and neutral expectations. BMC Med Genet. 20103:57.

Di Rienzo A, Hudson RR. An evolutionary framework for common diseases: the ancestral-susceptibility model. Trends Genet. 200521:596–601.

Ramachandran S, Deshpande O, Roseman CC, Rosenberg NA, Feldman MW, Cavalli-Sforza LL. Support from the relationship of genetic and geographic distance in human populations for a serial founder effect originating in Africa. Proc Natl Acad Sci U S A. 2005102:15942–7.

Skol AD, Scott LJ, Abecasis GR, Boehnke M. Joint analysis is more efficient than replication-based analysis for two-stage genome-wide association studies. Nat Genet. 200638:209–13.

Lachance J, Berens AJ, Hansen MEB, Teng AK, Tishkoff SA, Rebbeck TR. Genetic hitchhiking and population bottlenecks contribute to prostate cancer disparities in men of African descent. Cancer Res. 201878:2432–43.

Benjamin EJ, Virani SS, Callaway CW, Chamberlain AM, Chang AR, Cheng S, Chiuve SE, Cushman M, Delling FN, Deo R. Heart disease and stroke statistics—2018 update: a report from the American Heart Association. Circulation. 2018137:e67–e492.

Slatkin M, Rannala B. Estimating allele age. Annu Rev Genomics Hum Genet. 20001:225–49.

Novembre J, Barton NH. Tread lightly interpreting polygenic tests of selection. Genetics. 2018208:1351–5.

Braveman P, Egerter S, Williams DR. The social determinants of health: coming of age. Annu Rev Public Health. 201132:381–98.

Manrai AK, Funke BH, Rehm HL, Olesen MS, Maron BA, Szolovits P, Margulies DM, Loscalzo J, Kohane IS. Genetic misdiagnoses and the potential for health disparities. N Engl J Med. 2016375:655–65.

Stearns SC, Medzhitov R. Evolutionary medicine. Sunderland: Sinauer Associates, Inc., Publishers 2016.

Crespi BJ. The emergence of human-evolutionary medical genomics. Evol Appl. 20114:292–314.

Bigham AW, Magnaye K, Dunn DM, Weiss RB, Bamshad M. Complex signatures of natural selection at GYPA. Hum Genet. 2018137:151–60.

Shriner D, Rotimi CN. Whole genome sequence-based haplotypes reveal single origin of the sickle allele during the Holocene Wet Phase. Am J Hum Genet. 2018102:547–56.

Hunter DJ. Gene-environment interactions in human diseases. Nat Rev Genet. 20056:287–98.

Hemminki K, Bermejo JL, Försti A. Opinion: the balance between heritable and environmental aetiology of human disease. Nat Rev Genet. 20067:958.

Haugaard JJ, Hazan C. Adoption as a natural experiment. Dev Psychopathol. 200315:909–26.

Sankar P, Cho MK, Condit CM, Hunt LM, Koenig B, Marshall P, Lee SS, Spicer P. Genetic research and health disparities. JAMA. 2004291:2985–9.

Fine MJ, Ibrahim SA, Thomas SB. The role of race and genetics in health disparities research. Am J Public Health. 200595:2125–8.

Reisberg S, Iljasenko T, Läll K, Fischer K, Vilo J. Comparing distributions of polygenic risk scores of type 2 diabetes and coronary heart disease within different populations. PLoS One. 201712:e0179238.

Maples BK, Gravel S, Kenny EE, Bustamante CD. RFMix: a discriminative modeling approach for rapid and robust local-ancestry inference. Am J Hum Genet. 201393:278–88.

Guan Y. Detecting structure of haplotypes and local ancestry. Genetics. 2014196:625–42.

Vilhjalmsson BJ, Yang J, Finucane HK, Gusev A, Lindstrom S, Ripke S, Genovese G, Loh PR, Bhatia G, Do R, et al. Modeling linkage disequilibrium increases accuracy of polygenic risk scores. Am J Hum Genet. 201597:576–92.

Rosenberg NA, Huang L, Jewett EM, Szpiech ZA, Jankovic I, Boehnke M. Genome-wide association studies in diverse populations. Nat Rev Genet. 201011:356–66.

Berens AJ, Cooper TL, Lachance J. The genomic health of ancient hominins. Hum Biol. 201789:5–17.

Lachance J: AscertainmentBias_GWAS. Github Repository 2018. Accessed 24 Aug 2018.

How to get Population Genotype Frequency from 1000 genomes Perl API - Biology

PopLDdecay: A new simple and efficient software for Linkage Disequilibrium Decay analysis based Variant Call Format

Method1 For linux/Unix and macOS

Note: If fail to link,try to re-install the libraries zlib

Method2 For linux/Unix and macOS

Note: If fail to link,try to re-install the libraries zlib

see more detailed Usage in the Documentation

Linkage disequilibrium (LD) decay[1] is the most important and most common analysis in the population resequencing[2]. Special in the self-pollinated crops, the LD decay may not only reveal much about domestication and breed history[3], but also can reveal gene flow phenomenon, selection regions[1].However, to measure the LD decay, it takes too much resources and time by using currently existent software and tools. The LD decay studies also generate extraordinarily large amounts of data to temporary storage when you using the mainstream software "Haploview"[4], the classical LD processing tools. Effective use and analysis to get the LD decay result remains a difficult task for individual researchers. Here, we introduce PopLDdecay, a simple- efficient software for LD decay analysis, which processes the Variant Call Format (VCF)[5] file to produce the LD decay statistics results and plot the LD decay graphs. PopLDdecay is designed to use compressed data files as input or output to save storage space and it facilitates faster and more computationally efficient than the currently existent softwares. This software makes the LD decay pipeline significantly

Used Data of this web site to test follow software, with only two based site in chr22 (minimal SNP database) of the 1000 Genomes Project ALL the pair-wise SNP R^2 is the same.


We thank two anonymous reviewers, whose comments improved our manuscript. This work was supported by National Science Foundation grant DEB-1257806 and National Institutes of Health grant NIH-NIGMS R01-GM101672. It was also supported by the National Center for Genome Analysis Support, funded by National Science Foundation grant DBI-1458641 to Indiana University, and Indiana University Research Technology’s computational resources.

Note added in proof: See Ye et al. 2017 (pp. 1405) in this issue and Ackerman et al. 2017 (pp. 105) and Lynch et al. 2017 (pp. 315) in the GENETICS May issue for related work.

Watch the video: Applied Computational Genomics - 10 - VCF Annotation (January 2023).