Calculating sequence divergence score for a protein from identity or similarity score?

Calculating sequence divergence score for a protein from identity or similarity score?

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

I have% identityand% similarityscores for ~50K protein alignments, that I fetched from Ensembl Compara database. The issue is that I wanted to have divergence scores instead. So in order to calculate divergence scores, I first looked for the conventional method to calculate divergence score from protein alignments. But turns out, most of the methods/tools calculate% identityand% similarityscores instead of divergence scores.

(1) So I wonder what is the conventional method to calculate divergence scores from protein alignments.

(2) Could I simply calculate% sequence divergenceas100 - % identityor100- % similarityscore? If that's ok, should I prefer100 - % identityor100- % similarity?

Any suggestions from experts are welcome.

Update: As suggested in a comment below, I should mention that the average% identityscore is more than 70. I hope this information would be helpful.

This is a very tentative answer as I haven't done anything like this, but it's a learning experience for me and I hope it could be helpful to get a more knowledgeable response started.
I don't think you can do this. Per Wang, 2009 and Yona, 2002, divergence scores could be Kullback-Leibler divergence or Jensen-Shannon divergence, but both papers opt for the latter. Divergence scores are based on "empirical probability distributions between the 20 amino acids". BLOSUM or PAM matrices may be used. Simply counting the identical or similar residues wouldn't contain as much information. Wang, 2004 compared JensenShannon to numerous other scoring methods in the context of PSI-BLAST.

The similarity score is related to the divergence score: Score = 0.5(1-D)(1+S) where D is the divergence score and S is the significance score. It is equation #15 in Wang, 2009 paper. If you could safely assume the significance score for some alignments was near 1 (no chance similarities) then maybe you could say D = 1 - similarity score. Unfortunately, judging by the glossary because I haven't downloaded any of the datasets, for the Ensembl data similarity is merely "How well one sequence matches another determined by calculation by an alignment program of identical and conserved residues/nucleotides." So from that I take it they may just be counting up the "conserved" amino acids in the old fashioned way, but don't know that.


With the advent of the advanced sequencing techniques, researchers are generating a large number of protein sequences. This brings in a new challenge 1,2 for phylogenetic and comparative study of these protein sequences. Phylogenetic study and comparative analysis between taxa are an essential part of molecular biology and bioinformatics. These studies, traditionally depended on multiple or pairwise sequence alignments which are the well established classical approach and regarded as a standard method for sequence analysis. However, producing reliable multiple sequence alignments become extremely difficult when more dissimilar protein sequences are considered. The traditional alignment-based methods 3,4,5 are much empirical to select and create a sequence alignment score matrix, and variation of which may affect the alignment results. Various alignment-free tools 6,7,8,9,10,11,12,13 have been developed over the past two decades to overcome the alignment complexity for phylogenetic analysis. An alignment-free approach consist of two steps for comparing protein sequences. At the first step, the protein sequences are converted into a fixed-length feature vectors. Feature extraction is a series of process for extracting the required information from the query sequences, which is critical for the accuracy of an alignment-free method. At the second step, these extracted feature vectors are used as an input data in vectors similarity comparison algorithm to perform downstream analysis like phylogenetic analysis. Methods based on graphical representation, distance frequency matrix, numerical characterization, K-string dictionary etc., have been introduced to overcome the complication of the sequence alignment. Graphical representation 14,15 of protein sequences provides a simple way of viewing, sorting and comparing various sequences. It also provides mathematical descriptor which help in identifying differences among similar protein sequences quantitatively. Distance frequency of amino acid pairs suggest a new numerical characterization of protein sequence, which converts protein sequence into a distance frequency matrix 16 . Numerical characterization directly extracted from protein sequence would capture the essence of the amino acid composition and their distribution on the protein sequence in a quantitative aspect. In this approach, each sequence is mapped into a vector or matrix based on the numerical characterization extracted from the protein sequence. Subsequently, a similarity score is calculated by following distance measure tools, such as, Euclidean distance, Cosine distance, Manhattan distance, etc., among their corresponding vectors or matrices. K-string dictionary 17 approach permit users to use a much lower dimensional frequency or probability vector to represent a protein sequence. It also significantly reduces the space requirement for their implementation. Furthermore, after getting the lower dimensional frequency vectors, Singular Value Decomposition (SVD) is used to get a better protein vector representation which helps user to obtain a precise phylogenetic tree. However, these above mentioned methods are lagging behind in terms of accuracy. Thus, more discriminatory features are still needed to be developed. In addition to the accuracy, these method have another drawback and that is, computational complexity. Motivated by the aforementioned work, in this study, we proposed to use fuzzy integral algorithm 18,19 for analysis of protein sequence based on Markov chain 20 . Fuzzy integral similarity 21,22 method assigns similarity score within the closed interval [0, 1] between two protein sequences. A protein sequence consists of twenty amino acids. By taking these 20 amino acids as a state space M = <A, I, L, M, F, P, W, V, D, E, N, C, Q, G, S, T, Y, R, H, K>, we have used k th −step transition probability matrix, fuzzy measure 23 , fuzzy integral to describe protein sequence. We have used fuzzy integral similarity for getting distance matrix, which is used in neighbor program in PHYLIP package 24 for constructing a phylogenetic tree. The advantage of our method is, it do not require any prior knowledge of homologous relationship (common ancestry) among the sequences, which makes it fully automated and robust. For validation of our developed algorithm, we implemented our approach on NADH Dehydrogenase-5 protein sequences, NADH Dehydrogenase-6 protein sequences, xylanases protein sequences in the F10 and G11 datasets, transferrin protein sequences, coronavirus spike protein sequences and beta-globin protein sequences. We compared the tree generated by our method with the trees generated by both alignment-free method, and alignment-based ClustalW method using MEGA package 25 . In addition, we used few standard statistical tools such as correlation coefficient (CC), Robinson-Foulds distance (RF-distance) 26 and receiver operating characteristic (ROC) 27,28,29 curve to compare distance matrices generated by our method with the other alignment-free methods. The main purpose of this study is to compare the performance among alignment-based and alignment-free protein clustering methods and to identify their strengths and weakness from the practical perspectives of the users.

EFI - Enzyme Similarity Tool

The information needed for the generation of the dataset has been fetched and processed. The similarity between the sequences retrieved has been calculated. Now the Dataset Analysis page provides a summary about the input used, and the returned calculations for the all-by-all BLAST. You must interpret the provided information in order to choose an alignment score that will be used for the final step of the SSN generation.

Choosing an appropriate threshold

Networks are best interpreted with an alignment score upper limit that gathers the sequences into clusters that represent families with only a single function (termed "isofunctional"). If the alignment score is too large, the network may be overly fractured, and isofunctional families will be split into multiple subfamilies. If the alignment score is too small, multiple families will be merged into a single cluster.

If a sufficient number of annotations is available, the optimum alignment score is determined empirically by mapping known functions onto the network (using functional annotations/node attributes included with network files and/or those that can be added by the user) and observing how they partition into separate clusters as the alignment score is decreased. This can be done by increasing the alignment score using Cytoscape's filter function. With this in mind, the recommended procedure is to output the initial SSN with a "low" alignment score so that isofunctional families are not separated. Although this alignment score will depend on the family, a useful "rule-of-thumb" is that isofunctional families often share >40% sequence identity. Thus, we recommend that the alignment score used to output the initial SSN should correspond to a lower sequence identity, e.g., 35%.

Information provided

The input method is summarized, for the user's records.

When the all-by-all BLAST is complete, EFI-EST provides four plots on the DATA SET COMPLETED page that are used to guide the selection of the alignment score for outputting the SSN:

Number of Edges Histogram
1 ) number of edges vs. alignment score

Length Histogram
2 ) sequence length vs. occurrence

These can be viewed directly in your browser and/or downloaded to your computer.

The number of edges histogram allows an assessment of the number of edges as a function of alignment score in your dataset. The edges with large alignment scores (greater percent sequence identities) define isofunctional clusters the edges with small alignment scores (lesser percent sequence identities) define the relationships between the isofunctional clusters. For functional assignment purposes, segregation of the SSN into isofunctional clusters is essential to distinguish among functions. For understanding the sequence/structural bases for divergent evolution of function, the connections between the isofunctional clusters are important. Thus, this plot may assist you with the selection of the alignment score threshold for generating your SSNs. In most cases, the small alignment scores will dominate this histogram (with the computation of these edges between isofunctional clusters dominating/lengthening the computation time for the all-by-all BLAST).

The length histogram allows an assessment of length heterogeneity in your dataset. Many proteins/enzymes contain a single functional domain these will be the most straightforward for determining the alignment score to use for outputting the SSN (as described in the examples that follow). However, other proteins may have two or more domains as evidenced by the presence of longer sequences these have the potential of complicating the selection of the alignment score for outputting the SSN (as also described in the examples that follow). Finally, because of sequencing errors, truncated fragments are commonly observed. Although the number of fragments in any dataset likely will be small, these have the potential to confuse the appearance/interpretation of the quartile plots.

The quartile plots and their use in guiding the selection of the alignment score are described in the two examples that follow.


Example 1: a family of single domain proteins

For a "simple" case, the length histogram for the proline racemase superfamily (IPR008794), (Example 1B) shows that almost all of the proteins have roughly similar lengths within +/- 30 residues (a single domain).

The alignment length versus alignment score quartile plot (Example 1C) shows that as the alignment score increases, the length of the sequence that is included in the calculation of the alignment score increases to the full length of the proteins (

300 residues). For "small" alignment scores, when the alignment length is significantly less than the full length, both the alignment length and percent identity (Example 1D) plots show considerable "scatter" because short stretches of residues are responsible for the alignment score. The scatter is normal. [Some of the short alignment stretches may be caused by the presence of fragments that always are present as a result of sequencing errors.] In most cases, the small alignment score portions of both alignment length vs. alignment score and percent ID vs. alignment score quartile plots can and should be ignored.

Instead, attention should be given to those portions of both plots when the alignment score calculation results from alignment of the full length of the sequence (in this case at alignment scores >

20). For cases such as this, use the monotonic increase in percent identity as a function of increasing alignment scores to guide your initial selection of the alignment score to be used in generating the network file. Although there is no quantitative "rule" as to how function diverges as percent identity decreases, we recommend that your initial networks should be generated with an alignment score threshold that corresponds to

35% sequence identity. Thus, in this example, an alignment score of 50 would be a good starting point for the initial networks and would entered into the field in part 2.

After you have the network, you can use the filter function in Cytoscape to remove edges that correspond to alignment scores larger than the initial value, thereby generating SSNs in which the nodes in clusters share greater percent identities. If sufficient functional annotation information is available for your family, the alignment score/percent identity that defines isofunctional clusters in your SSN can be determined empirically by decreasing the alignment score threshold until the assigned functions segregate into separate clusters.

In this case, there is no reason to filter on length, because the fraction of fragments is small, and the vast majority of the sequences contain a single domain. Thus, no values would be entered in the fields in part 3.

Example 1 (A). Number of edges histogram for the proline racemase superfamily (IPR008794).

Example 1 (B). Length histogram for the proline racemase superfamily (IPR008794).

Example 1 (C). Alignment length vs. alignment score quartile plot for the proline racemase superfamily (IPR008794).

Example 1 (D). Percent identity vs. alignment score quartile plot for the proline racemase superfamily (IPR008794).

Example 2: multidomain proteins

In a more complicated situation, the polypeptides of homologous members of the vicinal oxygen chelate superfamily (VOC IPR004360) can be either a single domain or two tandemly fused homologous copies of the same domain. The active sites are located at the interfaces between two domains, either from two one-domain polypeptides or at the interfaces of the two-domain polypeptides. In this case, the length histogram is bimodal (Example 2B). The quartile plots also reflect the bimodality: in the alignment length vs. alignment score plot, as the alignment score increases, the alignment length plateaus as the length approaches that of the length of the single-domain polypeptides as the alignment score increases further, the alignment length increases and eventually plateaus at the length of the two-domain polypeptides (Example 2C). In the percent identity vs. alignment score quartile plot, the percent identity increases as the alignment score increases (Example 2D). However, when the alignment length increases to include the two-domain polypeptides, the percent identity decreases and then increases to again approach 100%. Notice that the interpretations of the quartile plots are inter-related: the "breaks" in the alignment length versus alignment score and percent identity versus alignment score plots occur at the same alignment score(s).

For multidomain proteins, our experience is that you should focus on the portions of the length and percent identity quartile plots at the smaller alignment scores that apply to alignment to the single domain and use that dependence of percent identity on alignment score to select the alignment score for generating your network. In this case, a value of 100 is an appropriate initial alignment score to enter in part 3.

Example 2 (A). Number of edges histogram for the VOC superfamily (IPR004360).

Example 2 (B). Length histogram for the VOC superfamily (IPR004360).

Example 2 (C). Alignment length vs. alignment score quartile plot for the VOC superfamily (IPR004360).

Example 2 (D). Percent identity vs. alignment score quartile plot for the VOC superfamily (IPR004360).

The field in part 4 is used to enter a title for your SSN. This title will be displayed in Cytoscape.

After the alignment score and length limits, if desired, are entered, EFI-EST generates the output network. As with dataset creation, this step may take awhile so you may close the running window in the meantime. When the network files are finished, you’ll receive an e-mail with a link to the file download page. This link will be active for 14 days so that you may return at your convenience.

*If you need a refresher on boxplots, there are several good online math resources (such as this page).

The SWISS-MODEL Template Library (SMTL)

The SWISS-MODEL template library is a large structural database of experimentally determined protein structures derived from the Protein Data Bank (Berman et al).

It serves as the main repository of structural information for the modelling pipeline and provides atomic coordinates of protein structures as well as maintains sequence and profile databases which can be searched by BLAST and HHblits. Alignment-independent properties of the templates are precalculated and stored in the database, e.g. a mapping between residues resolved in the experiment and corresponding residues in the full protein sequence, predicted solvent accessibility and secondary structure information.

Individual entries of the SMTL can be inspected using the web interface. The sequence features are linked to a 3D structure viewer and can be interactively explored. SMTL IDs consist of the PDB ID, an integer representing the biounit and a capital letter for the chain ID. The SMTL chain ID is not necessary, the same as the PDB chain ID. The mapping is shown in "SMTL:PDB".

Ligands can be marked as synthetic, natural or part of crystallisation buffer. This information is used by the modelling pipeline to determine whether a ligand is considered for inclusion into the final model.

Biological Assemblies (Biounit) of Templates

The biological assembly (biounit) describes the oligomeric state, or quaternary assembly, which is thought of as the biologically most relevant form of the molecule. For a detailed description see Biological Assemblies on PDB-101.

The biological assembly reported in the SMTL is retrieved from the PDB entry.

SMTL entries are organised (if more than one assembly is available) by likely quaternary structure assemblies which are created according to the author and software-annotated oligomeric states listed in the PDB deposition. If not all chains of the asymmetric unit are included by any biounit of a PDB entry, the asymmetric unit is included as a template.

EFI - Enzyme Similarity Tool

Prof. Patricia Babbitt’s group at UCSF first developed sequence similarity networks (SSNs) as a way to deal with the ever-increasing deluge of sequences deposited in public databases. Their seminal papers describing the SSN technique (1) and their program Pythoscape (2) were the inspiration for the EFI’s EFI-EST webserver.

EFI-EST is a web-tool that allows to easily generate SSNs that can be visualized in Cytoscape (3).

What is a Sequence Similarity Network?

A sequence similarity network (SSN) allows to visualize relationships among protein sequences. In SSNs, the most related proteins are grouped together in clusters.

Generating a SSN

The generation of a SNN involves two steps. First, a set of sequences to analyze is chosen, and an all-by-all BLAST is performed to determine, for each pair of sequences in the dataset, their similarity as a consideration of their relatedness. The second step involves filtering the sequences into clusters, based on a similarity threshold that is user defined.

Filtering sequences into clusters

When visualizing an SSN, protein sequences are represented as “nodes”. The line connecting two nodes is an “edge”. It is an indication of the relatedness between the nodes. An edge is drawn between nodes only if the BLAST pairwise similarity scores between the connected nodes is above a user defined threshold (Figure 1).

It is the user that defines the threshold at which sequences should be connected in a network. For families that contains non-isofunctional enzymes, a threshold score separating the different functions in different independent clusters is a good starting point. There is no predefined threshold: each protein set has its own optimal threshold that needs to be empirically determined.

Groups of highly similar proteins display a high degree of interconnectivity as the threshold alignment score is increased. These “clusters” are often very useful for the interrogation of enzyme function. Experienced users generate and compare several SSNs with various thresholds to visualize the interconnectivity evolution.

Figure 1. Example of a simple sequence similarity network as a function of e-value.

In the case of the Figure 1, if the alignment score threshold is specified as 28 (center), then edges are only drawn between nodes (protein sequences) that share that level of similarity (or greater). If two proteins are not connected, that means their sequences are less similar than described by the 28 threshold value. If the network is recalculated at a more stringent (greater) alignment score (right, threshold 56), the network is segregated into clusters of highly similar proteins. If the network is recalculated at a more permissive (lesser) alignment score (left, threshold 14), relationships between previously segregated proteins become apparent.

SSN VS phylogenetic trees

Although not as rigorous as traditional phylogenetic trees, SSNs typically display the same topology (Figure 2). However, the advantage of SSNs over trees is that large sequence sets (e.g. many thousands of proteins) can be analyzed much more quickly, and visualized easily using the network visualization Cytoscape (3).

Figure 2. Rooted phylogenetic tree (UPGMA) created with ClustalW (A) using the same sequence set as shown in the network in Figure 1 (B). Proteins in the tree are identified by their six character UniProt accession numbers.

Node attributes in a SSN

Besides speed, another major advantage of SSNs is the ability to include pertinent information for each individual protein (such as species, annotation, length, PDB deposition, etc.). This information is included as “node attributes” which are searchable and sortable within a sequence similarity network displayed in Cytoscape (3).

Figure 3. Representative node attributes for the example dataset as seen in a Cytoscape session.

Publication proven use of SSN

SSNs have been proven useful for examining the sequence relationships between proteins and have helped for functional assignment.

  1. To provide an overview of sequence-function relationships in a protein family
    For example, examining the number of clusters found in a SSN as the alignment score increases allows an approximation of how many distinct families are found within the sequence set, especially when known functions are mapped onto the network. From this analysis, it may be apparent that distinct families have different functions or they have the same function but evolved from different ancestors to form discrete groups. In Figure 1, it appears that there are 6 possible families, 2 grey families (“singletons”) and an individual family for each of the remaining colors. This information is useful for understanding the diversity within a group of proteins and for identifying regions of sequence space for which little to no functional information is available, indicating an excellent area for the discovery of new functions.
  2. To view a particular sequence’s relation within a larger set of sequences
    For example, if a cluster contains a protein of known function, it is possible that members of the entire cluster, including any unknown proteins of interest, share that function. This association is stronger if it remains intact at less stringent alignment scores. Likewise, if an unknown is found to be associated with a cluster of known function, but fractures at more stringent alignment scores (e.g. red and green nodes in Figure 1), it is possible that the unknown and known functions have a common partial reaction but differ in specificity. This scenario is common among functionally diverse superfamilies. In the simple network above, if the proteins in the red cluster were known to be isoprenoid synthases that catalyze elongation of shorter chain length products, then a plausible hypothesis could be that the proteins in the green cluster may catalyze elongation of medium or longer chain length products.

In either case, SSNs allow the user to quickly and easily view sequence relationships and gather information about proteins of known and unknown function. As sequence databases and needs to concatenate disparate information into a single visual aid grow, SSNs are increasingly more valuable for developing hypotheses.

Results and Discussion

Maximum Identity for Human-Chimp Alignments is 86–89%

See Tables 1 and 2 for a data summary of all 30 BLASTN experiments summarizing 1.2 million attempted alignments of 40,000 chimp sequences against four different versions of the human genome.

Overall human-chimp sequence similarity for alignable regions of the two genomes varied slightly between experiment groups regarding the usage of low-complexity sequence masking. For the unmasked set of experiments, DNA similarity varied from a low of 86.4% identity to a high of 88.9%, depending on the word size and e-value parameter combination (Table 1). The usage of unmasked sequence is an important consideration given recent research suggesting that key differences between human and chimp lie within low-complexity regions of the genomes (Polavarapu et al. 2011). For the set of experiments that employed low-complexity sequence masking for both query and subject, DNA similarity varied from a low of 86.2% identity to a high of 88.8%, depending on the word size and e-value parameter combination (Table 2). The usage of masking appeared to have a slight effect on the overall sequence similarity statistics. The most noticeable difference, however, was in computational processing time which was rapidly decreased with masking enabled (data not shown).

Table 1. BLASTN results based on the complete usage of query and subject sequence (masking disabled). Data is from 40,000 WGSS chimp trace archive reads queried against four human genome assemblies (GRCh37, GRCh36, alternate assembly, and the Celera assembly). Data for the top database hit, if it existed, was returned.

E-value ThresholdWord SizeNumber of Top Hits% Identity in Aligned BasesAverage Number Base Matches per Query SequenceAverage Number Aligned Bases per Query SequenceAverage Total Length of Query Sequence (Bases)

Overall DNA sequence similarity numbers in this study fall within the range of several earlier evolutionary publications where identities of 85–87% can be calculated for omitted data (Tomkins and Bergman 2012). Based on personal communication with NCBI staff, the 40,000 chimp sequences were considered to most likely be pre-screened and already known to be homologous to humans at some level, although this could not be verified by additional inquiries. Given the fact that under several algorithm parameter combinations (Tables 1 and 2), all 40,000 sequences had positive hits on the human genome(s), it is highly likely that the chimp query sequences were pre-screened for homology to human DNA. In addition, a large amount of data outside the aligned areas of each WGSS chimp sequence was omitted by the algorithm. Therefore, a maximum identity of about 86–89% is an extremely conservative and fair estimate. These data spectacularly confirm that on a whole-genome basis, the often-touted estimates of 98–99% similarity between humans and chimps are completely inaccurate.

Table 2. BLASTN results based on the usage of low-complexity sequence masking for both query and subject. Data is from 40,000 WGSS chimp trace archive reads queried against four human genome assemblies (GRCh37, GRCh36, alternate assembly, and the Celera assembly). Data for the top database hit, if it existed, was returned.

E-value ThresholdWord SizeNumber of Top Hits% Identity in Aligned BasesAverage Number Base Matches per Query SequenceAverage Number Aligned Bases per Query SequenceAverage Total Length of Query Sequence (Bases)

Effects of Word Size and E-value

Fig. 1. BLASTN results depicting the relationship between e-value and number of hits obtained. Maximum number of hits that could obtained were 40,000.

Fig. 2. BLASTN results depicting the relationship between e-value and average percent sequence identity.

Fig. 3. BLASTN results depicting the relationship between e-value and average alignment length.

The effect of word size was rather marked and the algorithm trends were in strong agreement with those described previously by Altschul et al. (1990). See Figs. 1, 2, and 3 for a graphical depiction showing the effects of word size, e-value, and sequence masking across all experiments.

Across all word sizes, there was clearly a general trend of computational trade-offs. As e-value became more stringent, less hits were achieved (Table 1), the percent average identity of the alignments was less (fig. 2), and the average alignment length increased (fig. 3). These effects, however, did not occur until the e-value dropped below 10. For all practical purposes, the 10 and 1000 e-values produced the same results across word sizes (Tables 1 and 2 figs. 1, 2, and 3). As mentioned previously, virtually all 40,000 sequences produced hits using liberal matching parameters indicating that the query sequences were previously screened for homology to human, an observation that was largely confirmed by information provided through email correspondence with NCBI staff. Of course, a significant trade-off is that the alignments produced with liberal matching parameters, were also much shorter in length. The usage of higher levels of stringency lengthened the alignments considerably, but also lowered the percent identity and the number of positive hits on the database.

Perhaps one of the most interesting aspects of the BLASTN query experiments was the fact that even under the conditions which produced the longest alignments, only 24% (181 bases—no masking) and 26% (191 bases—masking) on average were obtained (out of 740 bases). The most liberal parameters which produced the highest sequence identities and greatest number of hits had only 16% of the 740 bases aligning.

Default BLASTN Parameter Results

The standard recommended default parameters for BLASTN listed in the help material on the NCBI web site target several search conditions. For standard nucleotide BLAST, a default word size of 11 with an e-value of 10 is used in combination with sequence masking. The default parameters in this study (Table 1) produced a full 40,000 hits and was essentially the same as using an e-value of 1000. At these settings, average sequence identity was 87.6% for the aligned regions of each hit. Average alignment length, however, was at the short end of the spectrum at 125 out of 740 bases.

NCBI help material also recommends a word size of 7 with an e-value of 1000 (and no masking) for short or near-exact matches, typically needed for specific applications where precise target sequence is required to develop primers for PCR-based lab studies. In general, these parameters facilitated the alignment of all 40,000 sequences, produced identities of 87.2% and short alignments of 125 bases.

It should be noted that both of the above default parameter recommendations by NCBI are designed to facilitate the usage and speed of on-line searches at the NCBI web tool BLAST server ( Links to various help pages can also be accessed via the online BLAST server site.

In regards to the broad range of BLASTN experiments conducted in this study and the type of query application that they were applied to, there is limited published information available. Clearly, for future studies of this type, a comprehensive range of results can be achieved by utilizing a constant word size of about 15 in combination with e-values of 10 to 0.00001, thus reducing the number of experiments and computational resources involved.

BLAST Software Options for Genome-Wide Queries

The present study used 40,000 WGSS chimp sequences of about 740 bases on average that were queried against a database consisting of four different variants of the human genome assembly. Clearly this was a computationally intense effort that could not be performed on the NCBI BLAST web server given it’s restrictions on job size and algorithm parameter manipulation. In addition, the use of the BLAT alignment tool (BLAST-like alignment tool Kent 2002), as employed by the UCSC Genome Browser (, would have also been unsuitable for several reasons. First, the BLAT algorithm uses an indexed database that has low-complexity sequence omitted. Because BLAT does not directly compare sequence against sequence by using an indexing system and only returns highly identical hits, the current author did not seek to install it locally as a web-server and employ it for the present study. In fact, the UCSC web site makes the following statement regarding BLAT limitations on it’s “About BLAT” section of the BLAT server page (

BLAT on DNA is designed to quickly find sequences of 95% and greater similarity of length 25 bases or more. It may miss more divergent or shorter sequence alignments. It will find perfect sequence matches of 25 bases, and sometimes find them down to 20 bases.

Finally, the ability to fully exploit the many parameters available with the BLASTN algorithm and to query actual DNA sequence against DNA sequence is best accomplished with local command-line usage of the NCBI BLAST suite.

4. Methods

4.1 Protein sequence databases

We selected the genomes of Arabidopsis thaliana, Saccharomyces cerevisiae, Caenorrhabditis elegans and Drosophila melanogaster for the study. All four genomes are well-studied model organisms in eukaryotes. The complete set of Arabidopsis thaliana protein sequences for 27,288 ORFs was acquired from The Arabidopsis Information Resource (TAIR) [39]. We also obtained proteins sequences for 21,588 Caenorrhabditis elegans ORFs, 6350 Saccharomyces cerevisiae ORFs and 13,665 Drosophila melanogaster ORFs from NCBI [40]. Table 1 lists the number of ORFs for all the four genomes whose functions are annotated based on experimental evidences or sequence similarity measures for all the three functional categories.

4.2 Protein functional classification

The Gene Ontology (GO) functional classification [27] has three functional categories, i.e., biological process, molecular function and cellular component. It is not a hierarchical tree but the directed acyclic nature of the graph can be well captured in a series of numerical numbers. We have generated a numerical GO INDEX for all three classifications individually, which represents the structure of every ontology. The deepest level of index is 13. A GO Index, as denoted by numbers, e.g. 1-4-2-29, characterizes the function of every protein. The first number corresponds to the type of functional category, e.g. 1 represents biological process, 2 represents molecular function and 3 represents cellular component. The subsequent numbers correspond to subcategories describing the type of function or localization in increasing detail. The higher the GO Index level, the more specific is the functional category the protein belongs to. Table 2 shows an example of GO indices.

We assume that the functional relationship between two proteins is reflected by the number of index levels that they share. We have demonstrated the usefulness of such an assumption in our early studies for gene function prediction [28, 29]. We acquired the GO annotations for all the genes in the four genomes and for the three functional categories from GO Website [38]. A gene can (and usually does) belong to multiple indices at various levels in the graph, as proteins may be involved in multiple functions in a cell. Different indices could correspond to the same GO term as well.

Gene Ontology annotation is based on various evidences to annotate functional categories. Towards quality control, all the plots (except for Figure 6B) presented in this paper are based on the annotations with actual experimental evidences such as IDA (inferred from direct assay), IEP (inferred from expression pattern), IGI (inferred from genetic interaction), IMP (inferred from mutant phenotype), IPI (inferred from physical interaction), RCA (inferred from reviewed computational analysis) and TAS (traceable author statement). We performed some comparisons using annotations assigned purely based on computational methods such as ISS (inferred from sequence similarity) and IEA (inferred from electronic annotation), but the plots are not presented here. We have removed the functional annotations that were purely based on evidences such as ND (no biological data available) and NAS (non-traceable author statement.

4.3 Protein functional similarity

Within each family of proteins with similar sequences, functional similarity between proteins is expressed as the number of common roots shared by their functional classification other than the first level, which represents a classification of biological process, molecular function and cellular component. In the case of proteins with multiple functional assignments, the maximum indices of overlap are considered. For example, consider a gene pair ORF1 and ORF2, both annotated proteins. Assume ORF1 has a function represented by GO INDEX 1-1-3-3-4 and ORF2 has a function 1-1-3-2. When compared with each other for the level of matching GO INDEX, they match through INDEX level 1 (1-1) and level 2 (1-1-3) and will have functional similarity equal to 2. The functional similarity defined this way can assume values from 1 to 12.

We also calculate functional similarity in terms of semantic similarity between the GO functional annotation terms [30, 31]. An example of calculating the probabilities is shown in Figure 13. To calculate semantic similarity between the protein pairs, the probability of each term assigned to the gene product is first derived. For each gene in the organism, the probability is calculated by counting the number of the descendants of an assigned GO term plus 1 (the GO term itself), divided by the total number of GO term annotations in the organism. The probability of each node increases as we go towards the root of the GO ontology, which is defined as "Biological Process" (GO:0008150), "Molecular Function" (GO:0003674) or "Cellular Component" (GO:0005575) in the three Ontologies and has a probability of 1. The semantic similarity between ontology terms is defined as:

GO Biological Process sub-graph with probabilities and minimum subsumer. The numbers in parentheses denote the occurrence of the GO term and any of its descendants in the GO.

where, p ms(t1, t2) is the probability of the minimum subsumer for terms t1 and t2. The minimum subsumer for terms t1 and t2 is defined as the common parent of the deepest GO Index level shared by t1 and t2.

4.4 Protein subcellular localization

The subcellular distribution of proteins within a proteome is useful and important to a global understanding of the molecular mechanisms of a cell. Protein localization can be seen as an indicator of its function. Localization data can be used as a means of evaluating protein information inferred from other resources. Furthermore, the subcellular localization of a protein often reveals its activity mechanism. The subcellular localization information was predicted using SubLoc [32, 33, 41]. The five main subcellular localization categories as predicted by SubLoc are Cytoplasmic, Nuclear, Mitochondrial, Transmembrane, and Extracellular. The total numbers of proteins with predicted subcellular localization are 6323 in Saccharomyces cerevisiae, 27,288 in Arabidopsis thaliana, 21,588 in Caenorrhabditis elegans, and 18,498 in Drosophila melanogaster. It is worth mentioning that the subcellular localization predictions were not based on sequence similarity.

4.5 Protein sequence similarity search

The sequence similarity search was done using tools such as BLAST [23], FASTA [34, 35] and PSI-BLAST [24]. BLAST is the most widely used sequence comparison tool, particularly for genome annotation. FASTA is more sensitive in accuracy but slower than BLAST. Both FASTA and BLAST were developed for pairwise local alignment, with heuristics used. PSI-BLAST is used to identify remote homology based on iterative BLAST searches.

We compared the sequences for within as well as between genome sequence similarities. Each protein sequence was compared against the complete set of proteins for the same genome for within genome comparisons. For between genome comparisons, a pair of similar protein pair was identified using the reciprocal search method [36], i.e., the two proteins in the pair are the best hits in each other's genome from sequence search. Intra-genome sequence comparison would reflect the sequence similarity between the paralogs while the inter-genome comparison would partially highlight the orthologous sequence similarities.

To assess the significance of a sequence comparison, an expectation value or E-value can be calculated. This value represents the number of different alignments with the observed alignment score or better that are expected to occur in the database search simply by chance. The E-value is a widely accepted measure for assessing potential biological relationship, as it is an indicator of the probability for finding the match by chance. Smaller E-values represent more likelihood of having an underlying biological relationship. In this study, we will use both E-value and sequence identity as parameters to quantify sequence similarity. On the other hand, E-values depend on a number of computational factors, such as the length of the query protein and the size of search database. The issues prevent the E-value from being a reliable indicator for homology, as addressed in Fig. 1 and related discussions.

4.6 Availability

The data and results are publicly available at our website [42].

SDT_Linux & SDT_Mac (Sequence Demarcation Tool for 32 and 64 bit Linux and Mac OSX operating systems)

This a command-line version of SDTv1.0 written in Python, which runs on Linux and Mac OSX, 32 and 64 bit operating systems ).

Click here to download SDT_Linux32 → instructions (uploaded on 1/March/2014) Click here to download SDT_Linux64 → instructions (uploaded on 14/April/2014) Click here to download SDT_Mac32 → instructions (uploaded on 1/March/2014) Click here to download SDT_Mac64 → instructions (uploaded on 14/April/2014)

Calculating sequence divergence score for a protein from identity or similarity score? - Biology

2 Load the remaining aquaporins, 1rc2 , 1lda , and 1j4n . Make sure that each PDB is loaded into a new molecule. Close the Molecule File Browser window when you finished loading all four molecules. Your VMD Main menu should look like Fig. 32 when all four aquaporins are loaded.

The Multiseq window (with window name untitled.multiseq showing on top) should now be open. You may be asked to update some databases in a pop-up window if this is the first time you use MultiSeq. If this is the case, simply click Yes and wait for MultiSeq to finish downloading. When MultiSeq starts, your Multiseq window should look like Fig. 33, with a list of the four aquaporin protein structures and a list of two non-protein structures. The non-protein structures are detergent molecules used in crystallizing the aquaporin proteins, and will not be needed for structure or sequence alignment. You can tell MultiSeq to throw away molecules you are not interested in.

4 In the Multiseq window, select the 1lda_X detergent molecule by clicking on it. This will highlight the entire row of 1lda_X . Remove it from MultiSeq by pressing the delete or Backspace key on your keyboard. Do the same to remove the 1j4n_X detergent molecule.

MultiSeq uses the program STAMP (Russell et al., Proteins: Struct., Func., Gen. , 14 :309, 1992) to align protein molecules. STAMP (Structural Alignment of Multiple Proteins) is a tool for aligning protein sequences based on a three-dimensional structure. Its algorithm minimizes the distance between aligned residues of each molecule by applying globally optimal rigid-body rotations and translations. Note that you can only perform alignments on molecules that are structurally similar if you try to align proteins that have no common structures, STAMP will have no means of aligning them.

5 In the Multiseq window, select Tool Stamp Structural Alignment . This will open the Stamp Alignment Options window.

6 In the Stamp Alignment Options window, choose Align the following: All Structures and go to the bottom of the menu and press OK .

The molecules have been aligned. You can see the alignment both in the OpenGL window and in the MultiSeq window (Fig. 34). Your alignment in OpenGL window will not immediately resemble Fig. 34. When MultiSeq completes an alignment, it creates a new representation for all the aligned protein in the NewCartoon representation with the same default coloring method and hides all other representations created previously. Let's give different colors to different aquaporins to distinguish them.

7 Open your Graphical Representations window, and you should see two representations for each molecule, one on top created when VMD loaded the molecule (which is now hidden), and one on the bottom created automatically by MultiSeq. Select 0:1fqy.pdb in the Selected Molecule pull-down menu on top and highlight the bottom representation by clicking on it. Change the color for this representation by selecting ColorID 1 red for Coloring Method .

8 In the Graphical Representations window, select 1:1rc2.pdb in the Selected Molecule pull-down menu on top and highlight the bottom representation by clicking on it. Select ColorID 4 yellow for Coloring Method .

9 In the Graphical Representations window, select 2:1lda.pdb in the Selected Molecule pull-down menu on top and highlight the bottom representation by clicking on it. Select ColorID 11 purple for Coloring Method .

10 In the Graphical Representations window, select 3:1j4n.pdb in the Selected Molecule pull-down menu on top and highlight the bottom representation by clicking on it. Select ColorID 12 lime for Coloring Method . Close the Graphical Representations window.

Now your OpenGL window should look similar to Fig. 34, and you can see that the alignment was pretty good as the four aquaporin structures are very similar.

You can also get more information about the alignment in the MultiSeq window by highlighting the molecules you wish to compare.

11 In the MultiSeq window, highlight 1fqy by clicking on it. To highlight another molecule without unhighlighting 1fqy , you need to Ctrl-click (or command-click on Mac) on that molecule. Highlight 1rc2 by clicking on it while holding down the Ctrl key on the keyboard (or the command key on Mac). When both 1fqy and 1rc2 are highlighted, you should see on the lower left corner in the MultiSeq window a line of text: QH:0.6442, RMSD:2.3043, Percent Identity:30.28 as shown in Fig. 35. Note, the values you obtain might be a little different depending on if your MultiSeq database is updated, but they should be close to the ones in Fig. 35.

Figure 35: The Q , RMSD , and Percent Identity values can be used to determine how good an alignment is between two molecules, and how similar they are in structure and sequence.

The Q value is a metric for structural homology. It's an adaptation of the Q value that measures structural conservation 5 . Q=1 implies that structures are identical. When Q has a low score (0.1-0.3), structures are not aligned well, i.e., only a small fraction of the atoms superimpose. Along with RMSD and Percent Identity, these numbers tell you that the 1fqy and 1rc2 structures are pretty well aligned. You can repeat the previous step to compare the alignment of other molecules. To unselect a highlighted molecule, Ctrl-click on it again (or command-click on Mac).

12 In the MultiSeq window, choose View Coloring Qres .

13 Look at the OpenGL window to see the impact this selection has made on the coloring of the aligned molecules (Fig. 36). The blue areas indicate that the molecules are structurally conserved at those points. If there is no correspondence in structural proximities at these points, the points appear red. As you can see the -helices that form the pore are well conserved structurally among the four aquaporins, while there are more structural difference in the less functionally relevant loops.

1 In the MultiSeq window, select Tools Sequence Alignment window. In the Sequence Alignment Options window, choose ClustalW under Alignment Program and make sure the Align All Sequences option is checked, and go to the bottom of the window and select OK . Now the four aquaporins have been aligned according to their sequence using the ClustalW tool (Thompson et al., Nucl. Acids Res. , 22 :4673, 1994).

2 Let's color the aligned molecules by their sequence similarity. In the Multiseq window, choose View Coloring Sequence identity . Now each amino acid is colored according to the degree of conservation within the alignment: blue means highly conserved, whereas red means very low or no conservation. Your MultiSeq window and OpenGL window should now resemble Fig. 37.

You have now aligned the four aquaporins by their sequence and identified the conserved residues, which tend to locate inside the pore (Fig. 38). Since aquaporin facilitates water transport across the membrane, these conserved residues are most likely the ones that carry out this important function.

Figure 38: Top view of the aligned aquaporins colored by sequence conservation. The conserved residues locate mostly inside the aquaporin pore.

3 In the vmd-tutorial-files directory, find the provided FASTA sequence file spinach_aqp.fasta and open it with a text editor (Fig. 39a). A FASTA file contains a header that starts with `` ", followed by the name of the protein. In the next line is the protein sequence in one-letter amino acid code. You can create FASTA files similarly in this format. When you create a FASTA file, remember to save it in plain text, and use .fasta as the file extension. Close the text editor when you finish examining spinach_aqp.fasta .

4 In the MultiSeq window, select File Import Data. . Select From File in the Import Data window, and press the top Browse button to select the file spinach_aqp.fasta . Press OK on the bottom of the Import Data window.

You have now loaded the sequence information of a spinach aquaporin into MultiSeq (Fig. 39b). You can now preform sequence alignment on the spinach aquaporin protein with other loaded aquaporin molecules. Let's try a sequence alignment between spinach and human aquaporins.

5 Click on the checkbox on the left of spinach_aqp , and click on the checkbox on the left of 1fqy.pdb . Open the Sequence Alignment Options window by selecting Tools Sequence Alignment . Choose ClustalW as the Alignment Program and under the Multiple Alignment options on the top, check Align Marked Sequences . Go to the bottom of the window and select OK .

The sequence of spinach aquaporin is now aligned with the sequence of human aquaporin, and you can check how good the alignment is by obtaining its Q and Sequence Identity values. If you feel that the two molecules are listed too far apart in the MultiSeq window, you can move the molecules by dragging them with your mouse. Also, as you might have noticed, in MultiSeq molecules can be ``Marked'' by checking their checkboxes. They can also be ``Selected'' by highlighting them. You can align only the molecules of your choice by selecting Align Marked Sequences or Align Selected Sequences , depending if you have marked or highlighted your molecules. This option is available for both structural alignment and sequence alignment.

The structures of spinach aquaporin are actually available (Törnroth-Horsefield et al., Nature , 439 :688, 2006), but now that you have learned how to import FASTA sequence data, you can compare the sequences of proteins even if their structures are not yet published.

6 When you finish comparing the sequence of spinach aquaporin with other aquaporins, delete it by clicking on spinach_aqp and press delete or Backspace on your keyboard.

1 Align the structures again, by going to the Multiseq window and selecting Tools Stamp Structural Alignment .

2 In the Stamp Structural Alignment window, select All Structures , and keep the default values for the rest of the parameters. Press the OK button to align the structures.

3 In the Multiseq program window choose Tools Phylogenetic Tree . The Phylogenetic tree window will open.

4 Select Structural tree using Q , and press the OK button. A phylogenetic tree based on the Q values will be calculated and drawn as shown in Fig. 40. Here you can see the relationship between the four aquaporin, e.g., how the E.Coli AqpZ (1r2c) is related to human AQP1 (1fqy).

Figure 40: a) A structure-based phylogenetic tree generated by Q values. b) A sequence-based phylogenetic tree generated by ClustalW.

5 You can also construct the phylogenetic tree of the four aquaporins based on their sequence information. Close the Tree Viewer window

6 You need to perform the sequence alignment again for the four aquaporin proteins. In your MultiSeq window, choose Tools Sequence Alignment , check ClustalW under Alignment Program and make sure the Align All Sequences option is checked, and press OK .

7 In the Multiseq program window choose Tools Phylogenetic Tree to open Phylogenetic tree window again.

8 Select Sequence tree using ClustalW , and press the OK button. A phylogenetic tree based on ClustalW will be calculated and drawn as shown in Fig. 40.