How to characterise a protein family in a putative genome island?

How to characterise a protein family in a putative genome island?

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

We have sequenced the genome of 200 bacterial strains belonging to the same species, a swine bacterial pathogen. In a previous work, it was observed that a protein family of adhesins is present in some of these strains. Each strain has multiple copies separated in their genome. Genetic differences have been observed among the DNA sequence in the different strains. Since some mobile genetic elements are present near these genes, it is possible that they are playing a role in the molecular evolution of this protein family.

How could I add additional evidences that this is a putative genomic island?

Some references would be also appreciated.


UPDATE: For each strain, I thought about extracting the genomic region were these genes are located, maybe 8 Kb up- and down-stream of the gene. Then, running a miriad of tools to try to predict some of these island-associated elements. However, assuming that these "sequence-based" tools need the entire genome sequence, is it possible that I run them using the whole genome sequence and then focus on certain selected genomic regions?

Characterising Non-Structural Protein NS4 of African Horse Sickness Virus

African horse sickness is a serious equid disease caused by the orbivirus African horse sickness virus (AHSV). The virus has ten double-stranded RNA genome segments encoding seven structural and three non-structural proteins. Recently, an additional protein was predicted to be encoded by genome segment 9 (Seg-9), which also encodes VP6, of most orbiviruses. This has since been confirmed in bluetongue virus and Great Island virus, and the non-structural protein was named NS4. In this study, in silico analysis of AHSV Seg-9 sequences revealed the existence of two main types of AHSV NS4, designated NS4-I and NS4-II, with different lengths and amino acid sequences. The AHSV NS4 coding sequences were in the +1 reading frame relative to that of VP6. Both types of AHSV NS4 were expressed in cultured mammalian cells, with sizes close to the predicted 17–20 kDa. Fluorescence microscopy of these cells revealed a dual cytoplasmic and nuclear, but not nucleolar, distribution that was very similar for NS4-I and NS4-II. Immunohistochemistry on heart, spleen, and lung tissues from AHSV-infected horses showed that NS4 occurs in microvascular endothelial cells and mononuclear phagocytes in all of these tissues, localising to the both the cytoplasm and the nucleus. Interestingly, NS4 was also detected in stellate-shaped dendritic macrophage-like cells with long cytoplasmic processes in the red pulp of the spleen. Finally, nucleic acid protection assays using bacterially expressed recombinant AHSV NS4 showed that both types of AHSV NS4 bind dsDNA, but not dsRNA. Further studies will be required to determine the exact function of AHSV NS4 during viral replication.

Citation: Zwart L, Potgieter CA, Clift SJ, van Staden V (2015) Characterising Non-Structural Protein NS4 of African Horse Sickness Virus. PLoS ONE 10(4): e0124281.

Academic Editor: Jamil Saad, University of Alabama at Birmingham, UNITED STATES

Received: January 15, 2015 Accepted: March 12, 2015 Published: April 27, 2015

Copyright: © 2015 Zwart et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited

Data Availability: All new sequences have been deposited to GenBank, under the following accession numbers: KF859992, KP009629, KP009717, KF860003, KP009637, KM886360, KP009647, KP009767, KM609473, KP009659, KP009777, KM886352, KP009667, KP009789, KF860014, KP009679, KF860024, KP009689, KP009757, KF860034, KP009699, KF860044, KP009707. All previously published sequences can be accessed with the following accession numbers: AM883170, FJ196590, U19881, NC006019.

Funding: Deltamune (Pty) Ltd provided support in the form of salaries for authors (CAP), but did not have any additional role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript. The specific roles of these authors are articulated in the Author Contributions section.

Competing interests: The authors have declared that no competing interests exist.


Parasitic flatworms are responsible for a significant part of the global worm burden and are ubiquitous parasites of effectively all vertebrate species and many invertebrate groups. Over the past decade, reference and draft genomes of key fluke and tapeworm species have been produced including the causative agents of schistosomiasis, neurocysticercosis, and hydatid and alveolar echinococcosis [1,2,3,4,5,6]. Subsequently, improved assemblies and annotations have been published [7] and/or released to the public, as have RNA sequences from an increasing number of transcriptomic studies, profiling genome-wide gene expression for different life cycle stages, cell compartments, and experimental conditions [8,9,10,11]. Most recently, the diversity of draft genomes of both flatworm and roundworm helminths has been expanded, enabling broader circumscription of helminth-specific gene families and more informative comparative analyses [12]. Despite the growing number of such resources for helminths, little is yet known about their genomic architecture.

Rodent/beetle-hosted Hymenolepis species are among the principle tapeworm laboratory models as they enable access to all stages of their complex life cycle. A draft genome of the laboratory strain of the mouse bile-duct tapeworm [13], Hymenolepis microstoma, was published in 2013 [6] and updated with additional data and re-released as version 2 on WormBase ParaSite (WBP) [11] in 2015 (details of the v2 assembly are described in [8]). Here, we present the third major release of the genome: a reference quality update to the assembly that was made available to the public with the 12th release of WBP (December 2018). The genome has been assembled into full chromosomes, based on the addition of long-read sequence data to previous short-read data followed by extensive alignment, manual review, and re-assembly guided by optical mapping data. With this release, H. microstoma represents the most completely assembled genome of the lophotrochozoan superphylum.

Title: The LacI family protein GlyR3 co-regulates the celC operon and manB in Clostridium thermocellum

In this paper, we demonstrate that the GlyR3 protein mediates the regulation of manB. We first identify putative GlyR3 binding sites within or just upstream of the coding regions of manB and celT. Using an electrophoretic mobility shift assay (EMSA), we determined that a higher concentration of GlyR3 is required to effectively bind to the putative manB site in comparison to the celC site. Neither the putative celT site nor random DNA significantly binds GlyR3. While laminaribiose interfered with GlyR3 binding to the celC binding site, binding to the manB site was unaffected. In the presence of laminaribiose, in vivo transcription of the celC–glyR3–licA gene cluster increases, while manB expression is repressed, compared to in the absence of laminaribiose, consistent with the results from the EMSA. An in vitro transcription assay demonstrated that GlyR3 and laminaribiose interactions were responsible for the observed patters of in vivo transcription.

Results and Discussion

Genetic Complement

Genome structure and general features

The genome of EAEC 042 consists of a circular chromosome of 5,241,977 bp and one plasmid pAA of 113,346 bp. The general features of the EAEC 042 genome are presented in Table 1 . A total of 4,886 genes were identified in the chromosome, 100 (2%) of which do not have any match in the database, 556 (11%) are conserved hypothetical proteins, with no known function and only 481 (10%) seem to be mobile elements such as integrases, transposases, or phage related. We have identified 78 genomic islands in EAEC 042 that are differentially distributed/represented and/or sequence diverged among the sequenced E. coli genomes, these islands are designated regions of difference (ROD) ( Fig. 1 Table S1). The overall size of these RODs is 1.26 Mb (24% of the chromosome). The RODs encode virulence determinants, metabolic proteins, proteins with no obvious functions and mobile elements such as prophage and a conjugative transposon. The conjugative transposon Tn2411 (within ROD66) is highly similar to Tn21 and carries a variety of genes encoding antibiotic resistance ( Fig. 2 ). The functional significance of these genes is discussed below. Nine prophage regions, designated 042p1�p9, were identified in the EAEC 042 genome ( Table 2 ). Four of the prophage were lambdoid in nature (042p2, 042p3, 042p4 and 042p6) and were highly similar to each other however only three (042p1, 042p3 and 042p6) appeared to carry cargo genes (see Fig. S1 and File S1). The content of the remaining ROD are discussed in detail later.

From the outside in, the outer circle 1 marks the position of regions of difference (mentioned in the text) including prophage (light pink) fimbrial operons (Dark green) as well as regions differentially present in other E. coli strains: blue (Present in 0157:H7 & absent/divergent in UPEC CFT073) Light Green (Present in 0157:H7 absent/divergent in UPEC CFT073). Circle 2 shows the size in bps. Circles 3 and 4 show the position of CDSs transcribed in a clockwise and anticlockwise direction, respectively (for colour codes see below) circle 4 to 13 show the position of E. coli O42 genes which have orthologues (by reciprocal FASTA analysis) in other E. coli strains (see methods): Sakai (0157:H7 red), UT189 (UPEC dark blue), CFT073 (UPEC light blue), 536 (UPEC orange), APEC 01 (APEC dark pink), E2348/69 (EPEC black), <"type":"entrez-nucleotide","attrs":<"text":"H10407","term_id":"875229","term_text":"H10407">> H10407 (ETEC salmon pink), E24377A (ETEC pale pink), HS (grey), and K-12 MG1655 (green). Circle 14 sows the position of genes unique to E. coli 042 unique (red). Circle 15 shows a plot of Gʼ content (in a 10 Kb window). Circle 16 shows a plot of GC skew ([G𢄬]/[Gʼ] in a 10 Kb window). Genes in circles 3 and 4 are colour coded according to the function of their gene products: dark green = membrane or surface structures, yellow =�ntral or intermediary metabolism, cyan =�gradation of macromolecules, red = information transfer/cell division, cerise  =�gradation of small molecules, pale blue  = regulators, Salmon pink = pathogenicity or adaptation, black =𠂮nergy metabolism, orange =𠂬onserved hypothetical, pale green = unknown, brown = pseudogenes.

The Tn21 element is inserted between genes lpfA and glmS and constitutes ROD 66. The presence of this locus is consistent with the phenotypic information garned from the BioLog assays.

Table 1

Size (bp)5,241,977113,346
Predicted CDSs4,810152
Gʼ content (%)50.5649.55
Coding regions (%)86.680.4
Average CDS length (bp)943629

Table 2

RODPositionSizeSite of insertionCargo genesPosition of cargo genes
ROD15877116..9179264081115bp direct repeat gtrAB and Ec042-0853 encoding bactoprenol glucosyl transferase involved in O-antigen modification914835..917583
ROD181396781..145014953369Imperfect repeat
ROD211533876..15842765040115bp direct repeat sitABCD encoding iron transport proteins1580324..1583773
ROD261763886..181120047315Palindromic repeatEc042-1724 encoding putative exonuclease1807070..1809541
ROD271832521..184354511025Intragenic in Ec042-1748
ROD302214943..225664141699tRNA-Arg 31-mer direct repeat
ROD302256892..227238515494tRNA-MetEc042-2203 encoding putative exonuclease VIII2268605..2271046
ROD342551911..25600798169tRNA-ProEc042-2429, similar to proQ, structural element that influences osmotic activation of proP at posttranslational level Ec042-2430, PerC-like protein, similar to transcriptional activator of LEE in EPEC/EHEC2557981..2558798
ROD614236703..424773211030tRNA SelC(p)

On the basis of nucleotide sequence homology, the plasmid pAA belongs to the IncFIIA family. The plasmid includes 152 CDS, of which 32 are pseudogenes. Of the remainder, there are 7 that encode hypothetical proteins with no match in the database, 23 encode conserved hypothetical proteins with no predicted function, 55 have transfer, replication or plasmid maintenance functions, there are 18 mobile element-derived genes that encode transposases, and the remaining 17 CDS have demonstrated or predicted roles in virulence ( Table 1 and Fig. S2). Insertions in the plasmid include genes encoding many of the well-characterised EAEC 042 virulence factors and include the cytopathic toxin Pet, the AAF/II aggregative fimbriae, the AggR transcriptional regulator, dispersin and its cognate secretion machinery Aat, and operons encoding a putative iron transport system and a polysaccharide biosynthesis pathway all of which are discussed later.

E. coli core genome and pangenome

The EAEC 042 genome is largely colinear with that of the previously sequenced E. coli genomes except for a few inversions and insertions/deletions (Fig. S3). A box-plot showing the estimated core genome size (i.e. the genes conserved in all E. coli strains), as a function of the number of genomes sequenced for 100 randomly selected strain combinations is shown in Fig. S4. An exponential decay curve was fit using the R function nlrq [21], and gave a predicted core genome size of 2356 genes (Table S2). This is larger than the previous estimate of � [22], [23], possibly due to our inclusion of genes that are present but unannotated in some strains. The predicted core genome size is close to the number of genes conserved across all the genomes included in this study, suggesting that the number of possible gene deletions is close to saturation, and that further E. coli genome sequencing projects are unlikely to identify many novel gene deletions. The analysis indicates an open E. coli pangenome, as has been found in previous studies [22], [24], with an estimated 360 new genes being identified with each additional genome sequenced (Fig. S5). The E. coli core genome was further compared with the non-coli Escherichia albertii and Escherichia fergusonii, with 2173 genes found to be conserved (Table S2). Comparisons with the other available intact enterobacterial genomes showed that 967 genes were conserved across the family (Table S2).

E. coli phylogeny

A phylogeny was constructed based on the concatenated sequences of 2173 genes that are conserved in all E. coli strains and in E. albertii and E. fergusonii, which were included as outgroup sequences. The results are shown in Fig. S6. The established E. coli sub-groups (A, B1, B2, D and E) are all monophyletic with the exception of group D, which is divided by the root. E. coli strains SECEC SMS-3-5 and IAI39 cluster with group B2, which includes many extraintestinal pathogenic E. coli strains, whereas strains EAEC 042 and UMN026 cluster with groups A, B1, E and the Shigella strains. This corresponds with the conclusions drawn in a recent MLST study, where it was proposed to classify strains such as SMS-3-5 and IAI38 in a new group F [25]. However, we prefer to designate the two groups as D1 and D2, to retain compatibility with the previous nomenclature and to follow the precedent of group B.

Metabolic Profiling

E. coli K-12 strains, such as E. coli MG1655, have been used to characterise many of metabolic pathways we understand today. However, recent publications have described what most E. coli biologists have known for some time due to prolonged laboratory passage and a variety of treatments to remove λ-phage and the F plasmid, E. coli K-12 strains are not archetypal strains representing the biology of the genus [26]. The genotype of the E. coli K-12 strain MG1655 (F − , λ-, ilvG, rfb-50, rph-1) given by the E. coli stock Center, reflects only some of the differences between E. coli MG1655 and other E. coli strains. These differences extend beyond the additional virulence factors carried by pathogenic strains and include central metabolic functions carried by other E. coli strains but lost by E. coli K-12 strains [26]. To reveal a more representative metabolic profile for E. coli strains BioLog Phenotype Microarrays (PMs) were performed on EAEC 042 and compared with similar analyses of E. coli MG1655 (Table S3 and S4). The genetic basis accounting for significant differences between the strains are described. The major differences between the strains can be summarized into two main categories: resistance to antimicrobials and differences in nutrient utilization.

Antibiotic resistance/drug resistance

Soon after the discovery of the EAEC pathovar it was noted that many clinical isolates of EAEC displayed multiple antibiotic resistance [27]. Antibiotic resistance among EAEC strains is typically higher than among other diarrheagenic pathovars, perhaps accounting for the increasing isolation of EAEC from epidemiologic studies [28]. A variety of studies from geographically distinct areas have reported high levels of resistance to tetracycline, spectinomycin, streptomycin, trimethoprim-sulfamethoxazole and ampicillin [29]–[31]. The antibiotic resistance profile of E. coli 042 derived from PMs revealed resistance to sulphonamides, chloramphenicol, aminoglycosides and tetracyclines that was not exhibited by E. coli MG1655 (Table S3). This is consistent with the presence on the EAEC 042 chromosome of Tn2411, a Tn21-like transposon ( Fig. 2 ). Tn2411 possesses genes encoding resistance to chloramphenicol (cat) and tetracycline (tetA) and also includes a class 1 integron In2 that carries antibiotic resistance cassettes aadA1 (streptomycin and spectinomycin), suI (sulfonamide) and emrE (ethidium bromide) ( Fig. 2 ). Interestingly, the PM data revealed there was no difference between the ability of E. coli MG1655 and EAEC 042 to grow in the presence of ethidium bromide even though E. coli MG1655 does not possess the Tn2411 element possessing emrE. This can be explained by the fact that E. coli MG1655 possesses both the emrE (prophage associated) and the predicted multidrug efflux system emrYK on the chromosome, genes that are absent in the equivalent sections of the EAEC 042 genome. In addition, the Tn2411 element also possesses genes for mercury resistance (merRTPCAD). However, the PMs do not include an assay for growth in mercuric chloride. Nevertheless, the high identity between the mer genes in the Tn2411 element on the EAEC 042 chromosome and Tn21 strongly suggests that the EAEC 042 mer resistance is functional [32].

The PMs revealed EAEC 042 is more resistant to arsenite and antimony chloride than E. coli MG1655. This phenotype is most likely the result of the E. coli MG1655 ars operon lacking arsA (coding for the catalytic subunit of the ATP-driven arsenite/antimonite pump) or arsD (the trans-acting transcriptional repressor protein) genes (Fig. S7). Previous work has shown that in the absence of the ArsA ATPase subunit, ArsB confers only partial arsenite resistance by translocating these ions into the periplasm using energy derived either from the proton pumping respiratory chain or from F0F1 ATPase [33].

The PMs revealed that E. coli MG1655 is more resistant to acriflavine than EAEC 042. However, both EAEC 042 and E. coli MG1655 possess acrAB (Ec042-0500/0501) and tolC (Ec042-3326) the gene products of which act in concert to form an efflux system that confers acriflavine resistance [34], [35]. To determine whether E. coli MG1655 was more efficient than EAEC 042 at effluxing compounds such as acriflavine, the accumulation of Hoescht 33342 was determined. The accumulation of Hoescht 33342 reached a significantly higher steady state within EAEC 042 than E. coli MG1655 ( Fig. 3 ). The addition of the efflux pump inhibitor PAβN increased accumulation of the dye by both E. coli MG1655 and EAEC 042 although the effect was much greater for the latter ( Fig. 3 ). This indicates that the greater amount of this compound accumulated by EAEC 042 relative to E. coli MG1655 is likely to be a result of increased permeability of EAEC 042 rather than lack of efflux activity. Similar results were obtained with the use of carbonyl cyanide-m-chlorophenyl hydrazone (CCCP) a proton motive force inhibitor which inhibits the action of other efflux systems ( Fig. 3 ). Nevertheless, the increase in dye accumulation in the presence of PAβN and CCCP demonstrates that active efflux systems are present in EAEC 042. The higher resistance of E. coli MG1655 to acriflavine is thus most likely due to decreased uptake and is perhaps unsurprising as the related dye acridine orange was used to select K-12 derivatives lacking the F plasmid [26].

Hoescht 33342 is a substrate of the major AcrAB-TolC efflux systems. Accumulation by E. coli MG1655 and EAEC 042 was measured fluorometrically in the presence or absence of the efflux pump inhibitor PAβN and the proton-motive force inhibitor CCCP. Efflux of Hoescht 33342 is inhibited in the presence of both PAβN and CCCP in EAEC 042 and E. coli MG1655. However, Hoescht 33342 accumulates to higher levels in EAEC 042 than E. coli MG1655 suggesting EAEC 042 possesses a more permeable membrane.

To validate the genome sequence and PM data the minimum inhibitory concentration (MIC) for acriflavine and a panel of other antimicrobials was determined ( Table 3 ). The PM data was predictive of MIC although this correlation was not absolute. EAEC 042 was significantly more resistant to chloramphenicol, tetracycline, streptomycin and spectinomycin than E. coli MG1655, most likely as a result of the carriage of the specific resistance genes. There were no other significant (i.e. two dilutions or greater) differences in susceptibility to any of the other antimicrobials tested between EAEC 042 and E. coli MG1655.

Table 3

Agent E. coli MG1655EAEC 042
Chloramphenicol 4 a
Tetracyline 2
Streptomycin 2 𾄨
Spectinomycin 16 𾄨
Nalidixic acid 8 2
Ethidium bromide512256

Interestingly, susceptibility testing revealed EAEC 042 was four-fold more susceptible than E. coli MG1655 to nalidixic acid. This correlated with the PM data which indicated a reduced ability of EAEC 042 to grow in the presence of nalidixic acid compared to E. coli MG1655. PM data also demonstrated that E. coli MG1655 is more resistant than EAEC 042 to a variety of β-lactamase antibiotics, rifampicin, macrolides, and some other antimicrobials (Table S4). This could be due to increased uptake of the antibiotics, as described above, or may reflect the increased ability of EAEC 042 to recruit iron, as described below.

Iron acquisition

Iron is an essential nutrient for bacterial growth and is a major limitation to successful colonisation within the mammalian host. Mammals have high-affinity iron-binding proteins such as transferrin and lactoferrin that ensure free iron availability is maintained at extremely low levels. Counteracting these protective measures, pathogens have evolved a variety of high-affinity iron scavenging systems such as siderophore production, haem and haemoglobin uptake transporters [35]. We have reviewed differences in iron transport systems between the E. coli K-12 strain MG1655 and EAEC 042 and these are summarised in Table S5. Consistent with other pathogenic strains of E. coli, EAEC 042 possesses several additional iron-uptake systems compared to E. coli MG1655. These systems include the Shu transporter, required for the uptake of haem the Fit siderophore system for the uptake of ferrichrome the Yersiniabactin uptake system found on the Yersinia high pathogenicity island the SitABCD system which can recruit iron and manganese, a predicted bacterioferritin and a plasmid-encoded ferric (III) citrate transporter. Previous work has demonstrated that the Yersiniabactin system is intact and functional in EAEC 042 [36]. Moreover, whilst both strains contain the efeUOB oxidase-dependent ferrous iron transporter, the E. coli MG1655 transporter is non-functional due to a frameshift mutation in efeU [37]. To determine whether E. coli EAEC 042 was more efficient at sequestering iron we measured the total iron content of mid-exponential phase cultures of EAEC 042 and E. coli MG1655 strains grown in Neidhardt's Rich Defined Medium. We found that E. coli 042 contained 13.7ଐ.6 pmoles iron/mg protein compared to 8.15ଐ.976 pmoles iron/mg protein in E. coli MG1655. Thus, E. coli 042 contains 1.68-fold more iron than E. coli MG1655 a finding consistent with other pathogenic strains of E. coli (M. Goldberg, unpublished).

Recent studies by Goldberg and Lund (personal communication) have shown that the enhanced ability of pathogenic strains to scavenge iron can prove disadvantageous under oxidising conditions. Oxidative stress damages iron-binding proteins which results in an increase in free iron levels in the cytoplasm, triggering Fenton reactions [38], [39]. The enhanced free-radical production resulting from this reaction can overwhelm the cells' free-radical scavenging systems, causing further damage to iron binding proteins including the Fur repressor, leading to increased iron uptake and even greater oxidative damage. Cells containing larger amounts of intracellular iron are therefore more likely to succumb to oxidative stress than cells containing less. We found that E. coli 042 was more susceptible to the redox-cycling compound menadione (MIC =𠂡 mg/ml) than E. coli MG1655 (MIC =𠂢 mg/ml) indicating EAEC 042 is more susceptible to oxidative stress than E. coli MG1655. The PM data indicated that E. coli 042 is more susceptible to gyrase inhibitors (e.g. Nalidixic acid) and β-lactam antibiotics (e.g. Oxacillin, Phenethicillin) than E. coli MG1655 (Table S4 and ​ and3). 3 ). The ability of EAEC 042 to sequester higher levels of iron, in conjunction with the increased susceptibility to the oxidising reagent menadione, strongly suggests the increased susceptibility of EAEC 042 to these antibiotics is due to the triggering of Fenton reactions in a manner previously described by Kohanski and colleagues [39].

Carbon source utilization

Bacteria require a sufficient supply of carbon to feed their metabolic pathways. In their native environments heterotrophic organisms encounter limited amounts of complex mixtures of carbon sources that are often present at low concentrations. As a result microbial cells have developed multiple different systems to utilise a wide array of different substrates as carbon sources. Such differences are utilised in diagnostic tests to differentiate between particular species and strains of bacteria. PMs for sole carbon source utilization showed that EAEC 042 can utilise 2-Deoxy-D-Ribose more effectively than E. coli MG1655 (Table S5). Previous reports have shown that E. coli K-12 cannot use this carbon source, but Salmonella enterica serovar Typhimurium can, using the deoXKPQ gene products [40]. EAEC 042 possesses homologues of the S. Typhimurium deoXKPQ (Ec042-4753�).

N-acetyl-D-galactosamine and N-acetyl-D-glucosamine are components of intestinal mucin, as well as peptidoglycan. The PM screening demonstrated that EAEC 042 utilises N-acetyl-D-galactosamine and N-acetyl-D-glucosamine better than E. coli MG1655 (Table S3), confirming previous work showing that K-12 strains are unable to use these substrates as sole carbon sources [41]. In E. coli O157:H7 the genes for N-acetyl-D-galactosamine and N-acetyl-D-glucosamine utilisation were identified as the agaZVWEFASYBCDI gene cluster. In E. coli MG1655 a portion of this locus is missing due to site-specific recombination between agaW and agaA [42], however this locus is present and intact in EAEC 042 (Fig. S8). This observation is consistent with the recent demonstration that EAEC 042 can utilise intestinal mucin as a carbon source [43].

L-sorbose utilization by pathogenic E. coli and Shigella differs between strains [44], [45]. The PM data shows that EAEC 042 is significantly better at utilising L-sorbose than E. coli MG1655 (Table S3). Genome comparison between the two strains (Fig. S9) shows that EAEC 042, like other E. coli and Shigella pathotypes carries the sorEMABFDC operon (Ec042-4384�), located between ybiC and rluF. E. coli MG1655 does not carry this operon. BLAST analysis of the CDS in this region in EAEC 042 confirmed that the genes have high identity to previously described sor genes and thus are presumed to be functional.

Conversely, PM data showed that E. coli MG1655 grows and metabolizes D-Serine, Mucic acid (D-galactarate), β-D-Allose and D-Xylose more effectively than EAEC 042 (Table S4). A previous report has highlighted extensive genomic variability in the argW-dsdCXA genomic island in E. coli strains [46]. E. coli MG1655 has the dsdCXA gene cluster that codes for the ability to utilise D-serine, whereas EAEC 042 lacks these genes which accounts for the metabolic difference in serine utilization between the strains. Alignment of the gar operon from E. coli MG1655 which encodes the galactarate metabolic operon [47], [48] with the equivalent region of the EAEC 042 genome shows that the single ORF in E. coli MG1655 encoding the D-galactarate dehydrogenase enzyme (garD) is two CDS in EAEC 042, suggesting that enzyme function and hence metabolism of D-galactarate will have been disrupted in EAEC 042 by a mutation in this gene (Fig. S10). EAEC 042 is also defective in D-Allose metabolism, compared to E. coli MG1655 in the PMs. Alignment of the genomes centred on the als operon (Fig. S11) [49], [50] shows complete absence of the als operon (rpiB, rpiR, alsBACE and K) in EAEC 042. EAEC 042 also shows a lower level of metabolism compared to E. coli MG1655 when xylose is used as a sole carbon source. This can be explained by the absence from EAEC 042 of the xylE gene encoding the Major Facilitator Superfamily low-affinity xylose proton symporter. There is a second xylose uptake system, an ABC transporter (xylFGH) present in both strains which has been reported to be the dominant xylose transport system under both aerobic and anaerobic conditions [51] and functionality of this system leading to a reduced, but nonetheless effective uptake of xylose is consistent with the metabolic differences between E. coli MG1655 and EAEC 042 when xylose is the sole carbon source.

Many of the phenotypic differences between EAEC 042 and E. coli MG1655, which were observed in the PMs, do not have an easily identifiable genetic basis. However, the increased ability of EAEC 042 to take up certain compounds, as measured in the PAβN experiments ( Fig. 3 ), may explain why EAEC 042 is capable of metabolizing several compounds that E. coli MG1655 is not. The normal pathway for uptake of small molecules is via porins [52], [53]. Porins act as molecular sieves to allow passive diffusion of low molecular weight solutes (𼘀 Da) into the cell. Although structurally similar, the different porins have differing pore sizes, ionic selectivity and expression profiles allowing the bacterium to adapt to the variable environments [54]–[56]. In contrast to E. coli MG1655, which possesses four porin genes (ompF, ompC, phoE and ompN) and one pseudogene (nmpC/ompD), EAEC 042 possesses six intact genes encoding porins. Like E. coli MG1655, EAEC 042 possesses ompF (Ec042-1020), ompC (Ec042-2456), phoE (Ec042-0302) and ompN (Ec042-1523), but it also possesses an apparently functional ompD (Ec042-1601) and an additional phylogenetically distinct porin (Ec042-2121) that is differentially represented amongst pathogenic E. coli but whose precise function is unknown ( Fig. 4 and Fig. S12). In addition to the important physiological roles played by porins, these molecules are under constant selective pressure due to their recognition by the phages, colicins and the immune system. Indeed, OmpD from S. enterica Typhimurium was recently shown to be a key target of a protective T-independent antibody response and its universal presence amongst non-typhoidal Salmonella suggest it plays an important role in the ability of enteric organisms, such as EAEC 042, to persist in the intestine and interact with the host [57].

Non-coding sequence evolution

NcRNA sequence evolution

The availability of complete sequence from 12 Drosophila genomes, combined with the tractability of RNA structure predictions, offers the exciting opportunity to connect patterns of sequence evolution directly with structural and functional constraints at the molecular level. We tested models of RNA evolution focusing on specific ncRNA gene classes in addition to inferring patterns of sequence evolution using more general datasets that are based on predicted intronic RNA structures.

The exquisite simplicity of miRNAs and their shared stem-loop structure makes these ncRNAs particularly amenable to evolutionary analysis. Most miRNAs are highly conserved within the Drosophila genus: for the 71 previously described miRNA genes inferred to be present in the common ancestor of these 12 species, mature miRNA sequences are nearly invariant. However, we do find a small number of substitutions and a single deletion in mature miRNA sequences (Supplementary Table 14), which may have functional consequences for miRNA–target interactions and may ultimately help identify targets through sequence covariation. Pre-miRNA sequences are also highly conserved, evolving at about 10% of the rate of synonymous sites.

To link patterns of evolution with structural constraints, we inferred ancestral pre-miRNA sequences and deduced secondary structures at each ancestral node on the phylogeny (Supplementary Information section 12.1). Although conserved miRNA genes show little structural change (little change in free energy), the five melanogaster group-specific miRNA genes (miR-303 and the mir-310/311/312/313 cluster) have undergone numerous changes across the entire pre-miRNA sequence, including the ordinarily invariant mature miRNA. Patterns of polymorphism and divergence in these lineage-specific miRNA genes, including a high frequency of derived mutations, are suggestive of positive selection 140 . Although lineage-specific miRNAs may evolve under less constraint because they have fewer target transcripts in the genome, it is also possible that recent integration into regulatory networks causes accelerated rates of miRNA evolution.

We further investigated patterns of sequence evolution for the subset of 38 conserved pre-miRNAs with mature miRNA sequences at their 3′ end by calculating evolutionary rates in distinct site classes (Fig. 6, and Supplementary Information section 12.2). Outside the mature miRNA and its complementary sequence, loops had the highest rate of evolution, followed by unpaired sites, with paired sites having the lowest rate of evolution. Inside the mature miRNA, unpaired sites evolve more slowly than paired sites, whereas the opposite is true for the sequence complementary to the mature miRNA. Surprisingly, a large fraction of unpaired bulges or internal loops in the mature miRNA seem to be conserved—a pattern which may have implications for models of miRNA biogenesis and the degree of mismatch allowed in miRNA–target prediction methods. Overall these results support the qualitative model proposed in ref. 141 for the canonical progression of miRNA evolution, and show that functional constraints on the miRNA itself supersede structural constraints imposed by maintenance of the hairpin-loop.

Bootstrap distributions of miRNA substitution rates. Structural alignments of miRNA precursor hairpins were partitioned into six site-classes (inset): (1) hairpin loops unpaired sites (2) outside, (3) in the complementary region of, and (4) inside the miRNA and base pairs (5) adjacent to and (6) involving the miRNA. Whiskers show approximate 95% confidence intervals for median differences, boxes show interquartile range.

To assess constraint on stem regions of RNA structures more generally, we compared substitution rates in stems (S) to those in nominally unconstrained loop regions (L) in a wide variety of ncRNAs (Supplementary Information section 12.3). We estimated substitution rates using a maximum likelihood framework, and compared the observed L/S ratio with the average L/S ratio estimated from published secondary structures in RFAM, which we normalized to 1.0. L/S ratios for Drosophila ncRNA families range from a highly constrained 2.57 for the nuclear RNase P family to 0.56 for the 5S ribosomal RNA (Supplementary Table 15).

Finally, we predicted a set of conserved intronic RNA structures and analysed patterns of compensatory nucleotide substitution in D. melanogaster, D. yakuba, D. ananassae, D. pseudoobscura, D. virilis and D. mojavensis (Supplementary Information section 13). Signatures of compensatory evolution in RNA helices are detected as covarying nucleotide sites or ‘covariations’ (that is, two Watson–Crick bases that interact in species A replaced by a different Watson–Crick pair in species B). The number of covariations (per base pair of a helix) depends on the physical distance between the interacting nucleotides (Supplementary Fig. 9), as has been observed for the RNA helices in the Drosophila bicoid 3′ UTR region 142 . Short-range pairings exhibit a higher average number of covariations with a larger variance among helices than longer-range pairings. The decrease in rate of covariation with increasing distance may be explained by physical properties of a helix, which may impose selective constraints on the evolution of covarying nucleotides within a helix. Alternatively, if individual mutations at each locus are deleterious but compensated by mutations at a second locus, given sufficiently strong selection against the first deleterious mutation these epistatic fitness interactions could generate the observed distance effect 143 .

Evolution of cis-regulatory DNAs

Comparative analyses of cis-regulatory sequences may provide insights into the evolutionary forces acting on regulatory components of genes, shed light on the constraints of the cis-regulatory code and aid in annotation of new regulatory sequences. Here we rely on two recently compiled databases, and present results comparing cis-regulatory modules 144 and transcription factor binding sites (derived from DNase I footprints) 145 between D. melanogaster and D. simulans (Supplementary Information section 8). We estimated mean selective constraint (C, the fraction of mutations removed by natural selection) relative to the ‘fastest evolving intron’ sites at the 5′ end of short introns, which represent putatively unconstrained neutral standards (Supplementary Information section 8.2) 146 . Note that this approach ignores the contribution of positively selected sites, potentially underestimating the fraction of functionally relevant sites 147 .

Consistent with previous findings, Drosophila cis-regulatory sequences are highly constrained 148,149 . Mean constraint within cis-regulatory modules is 0.643 (95% bootstrap confidence interval = 0.621–0.662) and within footprints is 0.692 (0.655–0.723), both of which are significantly higher than mean constraint in non-coding DNA overall (0.555 (0.546–0.563)) and significantly lower than constraint at non-degenerate coding sites (0.862 (0.856–0.868)) and ncRNA genes (0.864 (0.846–0.880)) (Supplementary Fig. 10). The high level of constraint in cis-regulatory sequences also extends into flanking sequences, only declining to constraint levels typical of non-coding DNA 40 bp away. This is consistent with previous findings that transcription factor binding sites tend to be found in larger blocks of constraint that cluster to form cis-regulatory modules 150 . To understand selective constraints on nucleotides within cis-regulatory sequences that have direct contact with transcription factors, we estimated the selective constraint for the best match to position weight matrices within each footprint 151 core motifs in transcription-factor-binding sites have a mean constraint of 0.773 (0.729–0.814), significantly greater than the mean for the footprints as a whole, and approaching the level of constraint found at non-degenerate coding sites and in ncRNA genes (Supplementary Fig. 10).

We next examined the variation in selective constraint across cis-regulatory sequences. Surprisingly, we find no evidence that selective constraint is correlated with predicted transcription-factor-binding strength (estimated as the position weight matrix score P-value) (Spearman’s r = 0.0681, P = 0.0609). We observe significant variation in constraint both among target genes (Kruskal–Wallis tests, footprints, P < 0.0001 and position weight matrix matches within footprints, P = 0.0023) and among chromosomes (cis-regulatory modules, P = 0.0186 footprints, P = 0.0388 and position weight matrix matches within footprints, P = 0.0108 Supplementary Table 16).


Integrative Conjugative Elements (ICEs) carry functional modules involved in their conjugative transfer, chromosomal integration and for control of expression of ICE genes [1]. ICEs are maintained in their host via site-specific integration and establishment at a unique site or sites in their host [2-7]. ICEs have been discovered in the genomes of various low G+C Gram-positive bacteria, various α, β- and γ-Proteobacteria, and Bacteroides species [8]. The first ICE found was Tn916 from Bacteroides species [8].

One of the best models of ICEs is a family of elements called the R391SXT family that are found in γ-Proteobacteria. These are interesting elements as over 25 have been found to date in organisms spread across the world. They share a common core scaffold of genes related to integration, excision, transfer and regulation. Different elements can possess different fitness determinants such as antibiotic resistances, heavy metal resistances, and error-prone DNA repair systems [9].

Tn4371 is a 55-kb ICE, which allows its host to degrade biphenyl and 4-chlorobiphenyl. It was isolated after mating between Cupriavidus oxalaticus (Ralstonia oxalatica) A5 carrying the broad-host-range conjugative plasmid RP4 and Cupriavidus metallidurans (Ralstonia metallidurans) CH34. Selection was applied for transconjugants that expressed the heavy metal resistances from CH34 and grew with biphenyl as a sole source of carbon and energy [10]. The transconjugants carried an RP4 plasmid with a 55-kb insert near its tetracycline resistance operon. The insert was shown to transpose to other locations and hence was called Tn4371 [10-12]. Tn4371 has been sequenced [13] and closely related elements have been found in the genome sequences of a number of bacteria including Ralstonia solanacearum GMI1000, a phytopathogen from French Guyana [14], Cupriavidus metallidurans CH34, a heavy metal resistant bacteria from Belgium [15], Erwinia chrysanthemi 3937, aphytopathogen [16] and Azotobacter vinelandii AvOP, a nitrogen-fixing bacterium isolated from soil in the USA [13,17]. None of these other elements possessed the biphenyl and 4-chlorobiphenyl degradation genes.

The Tn4371-like ICEs characterised to date are mosaic in structure consisting of Ti-RP4-like transfer systems, an integrase region, plasmid maintenance genes and accessory genes [13]. All the characterised elements integrate into sites on the bacterial genomes with a conserved 5'-TTTTTCAT-3' sequence, termed the attB site [11]. Tn4371 transposition most likely involves a site-specific integration/excision process, since the ends of the element can be detected covalently linked as a transfer intermediate [11,13]. Integration is catalysed by a tyrosine based site specific recombinase related to bacteriophage and ICE family integrases [18].

A small number of putative ICEs have been discovered following sequence analyses of genomes of various low G+C Gram-positive bacteria [19], various α, β- and γ-Proteobacteria [20-22], and Bacteroides species [23].

We now report the discovery and comparative analysis of a number of novel uncharacterised Tn4371-like ICEs from several different bacterial species. These elements are also mosaics of plasmid and other genes and posses a common scaffold with apparent hotspots containing insertions of different presumably adaptive genes. Using sequences from the common scaffold a PCR method was developed to discover and characterise new Tn4371-like ICEs in different bacteria. Here we report on the use of this method to discover and characterise two new Tn4371-like ICEs in Ralstonia pickettii strains isolated from a purified water system. Furthermore we propose a uniform nomenclature for newly discovered ICEs of the Tn4371 family


Genomic sampling of mosquito diversity has greatly improved in recent years and looks set to take advantage of emerging technologies to explore even further.

Evolutionary genomics analyses have unveiled dynamic patterns of gene and genome evolution likely linked to mosquito adaptability that will guide future research and control efforts.

Functional genomics assays have helped to characterise biological roles of thousands of genes, albeit with condition-patchy and species-biased coverage that is now starting to be remedied.

Comparative genomics approaches are increasingly being applied to contextualise and enhance the interpretation of results from multispecies studies with an evolutionary perspective.

These trends mean that effective data sharing will be critical to facilitate future integrative meta-analyses and fully harness the benefits of combined evolutionary and functional analyses.

Mosquitoes are widely despised for their exasperating buzzing and irritating bites, and more poignantly because, during blood-feeding, females may transmit pathogens that cause devastating diseases. However, the ability to transmit such viruses, filarial worms, or malaria parasites varies greatly amongst the ∼3500 recognised mosquito species. Applying omics technologies to sample this diversity and explore the biology underlying these variations is bringing increasingly greater resolution that enhances our understanding of mosquito evolution. Here we review the current status of mosquito omics, or ‘mozomics’, resources and recent advances in their applications to characterise mosquito biology and evolution, with a focus on the intersection of evolutionary and functional genomics to understand the putative links between gene and genome dynamism and mosquito diversity.

Genome-wide analysis of Hsp70 and Hsp100 gene families in Ziziphus jujuba

The Ziziphus species are naturally tolerant to a range of abiotic stresses. Therefore, it is expected that they are an enriched source of genes conferring stress tolerance. Heat shock proteins (Hsps) play a significant role in plants in imparting tolerance against abiotic stress conditions. To get an insight into potential Hsp function in Ziziphus, we performed a genome-wide analysis and expression study of Hsp70 and Hsp100 gene families in Ziziphus jujuba. We identified 21 and 6 genes of the ZjHsp70 and ZjHsp100 families, respectively. Physiochemical properties, chromosomal location, gene structure, motifs, and protein domain organization were analysed for structural and functional characterization. We identified the contribution of tandem and segmental gene duplications in expansions of ZjHsp70s and ZjHsp100s in Z. jujuba. Promoter analysis suggested that ZjHsp70s and ZjHsp100s perform diverse functions related to abiotic stress. Furthermore, expression analyses revealed that most of the Z. jujuba Hsp genes are differentially expressed in response to heat, drought, and salinity stress. Our analyses suggested ZjHsp70-3, ZjHsp70-5, ZjHsp70-6, ZjHsp70-16, ZjHsp70-17, ZjHsp70-20, ZjHsp100-1, ZjHsp100-2, and ZjHsp100-3 are potential candidates for further functional analysis and with regard to breeding new more resilient strains. The present analysis laid the foundation for understanding the molecular mechanism of Hsps70 and Hsp100 gene families regulating abiotic stress tolerance in Z. jujuba.

This is a preview of subscription content, access via your institution.


Amiloride effects and the Wieczorek model

The Drosophila Malpighian tubule is exquisitely sensitive to agents hypothesised to affect the components of the Wieczorek model for insect epithelia, namely the apical V-ATPase and associated exchanger. However, while bafilomycin is agreed to be a selective inhibitor of V-ATPase, amiloride could target a range of molecules on apical or basal surfaces. Here, we show that the Malpighian tubules are blocked by a characteristic range of amiloride derivatives characteristic of exchangers, rather than channels. The order of inhibition of fluid secretion in Malpighian tubules is EIPA≫2,4-dichloro-benzamil>DMA>amiloride≈benzamil (Fig. 1 Table 1). These results are consistent with those recently obtained in Aedes aegypti (Petzel, 2000), increasing our confidence that amiloride targets NHEs in insect Malpighian tubules. This is also consistent with our RT-PCR data showing that Drosophila NHE genes, but not ENaC genes (Fig. 9, see also Fig. 2), are expressed in Malpighian tubules. Our failure to identify ENaCs in Malpighian tubules is consistent with electrophysiological analysis, which showed that amiloride inhibited transepithelial Na + secretion in Aedes aegypti Malpighian tubules without any effect on transepithelial and fractional membrane resistance (Hegarty et al., 1992).

The Drosophila NHE family

This paper describes three genes that appear to encode the Drosophila members of the NHE gene family. Their protein sequences are quite different from the protein sequences of the other members of the family, but there is sufficient similarity to the other NHEs to assign them unambiguously to this group of proteins (Fig. 6). More specifically, DmNHE1 appears to be very similar to a novel human NHE, KIAA0939, which has been found in kidney (IMAGE 3134373) and brain (GenBank accession number AB023156) (Nagase et al., 1999). DmNHE2 is most similar to two invertebrate NHEs, the NHE found in Carcinus maenas and the newly described NHE3 in Aedes aegypti (GenBank accession number AF80554 S. S. Gill, H. Wediak and L. S. Ross, unpublished). DmNHE3 sits near human mitochondrial NHE6 (although DmNHE3 encodes no mitochondrial targeting sequences) and also close to Arabidopsis thaliana and yeast genes.

The three DmNHEs described above are predicted to be plasma membrane integral proteins with 10–12 transmembrane domains just like the other members of the family (Fig. 6). All the Drosophila NHEs have a putative signal peptide and a possible cleavage site. This is similar to the position in mammalian NHEs, although it is not certain whether the signal peptide is ever cleaved (Zizak et al., 2000) see Shrode et al. (Shrode et al., 1998) and Wakabayashi et al. (Wakabayashi et al., 2000). The presence of distinct messages for DmNHE2, encoding peptides with differing C-terminal domains, has interesting implications for control of the exchanger.

In principle, the elucidation of genes in Drosophila would allow the reverse genetic analysis of their function in mutants. However, there are no candidate P-element insertions documented at any of the three loci. The nearest mutation is an insertion, 2 kb beyond the 3′ end of DmNHE3, that generates a lethal recessive phenotype. However, this insertion is at the 5′ end of a novel gene (CG11329), and so the lethality is probably attributable to the latter locus.

Are any of these genes candidates for the Wieczorek exchanger? Their relative dissimilarity to cardinal vertebrate NHEs (Fig. 8) would allow them to be ascribed different functional properties. For example, DmNHE2 sits in a branch of the similarity tree with only invertebrate representatives and so would be a strong candidate. Our data show that, in Drosophila, all three exchangers are widely expressed (Fig. 9) and are certainly present in a relevant epithelium (the Malpighian tubule). However, the same general expression pattern would argue against a specialised role in transporting epithelia only, and our pharmacological analysis (Fig. 1) does not distinguish between an apical or basolateral localisation. Recent electrophysiological evidence suggests that amiloride may be acting at the basolateral membrane of Aedes aegypti Malpighian tubules (Petzel, 2000), and our results cannot be taken to contradict this view. In insects, it may have been naïve to assume that sensitivity to bafilomycin and amiloride is sufficient proof that an epithelium conforms to the Wieczorek model. However, whether DmNHE1, DmNHE2 or DmNHE3 transpires to be the elusive apical exchanger, or a vital part of the cell’s ion-regulatory machinery, the description of this gene family in a genetic model organism should be useful.

Sensitivity of secretion by the Drosophila melanogaster Malpighian tubule to inhibition by amiloride and its derivatives. Dose/response curves for amiloride, 5-N,N-dimethyl amiloride (DMA), benzamil, 5-N-ethyl-N-isopropyl amiloride (EIPA) and 2′,4′-dichlorobenzamil (DCB). The upper limits of each graph are determined by the solubility of the compounds. Values are means ± s.e.m . (N=10).

Sensitivity of secretion by the Drosophila melanogaster Malpighian tubule to inhibition by amiloride and its derivatives. Dose/response curves for amiloride, 5-N,N-dimethyl amiloride (DMA), benzamil, 5-N-ethyl-N-isopropyl amiloride (EIPA) and 2′,4′-dichlorobenzamil (DCB). The upper limits of each graph are determined by the solubility of the compounds. Values are means ± s.e.m . (N=10).

Malpighian tubules do not express epithelial Na + channels (ENaCs). (A) Phylogenetic tree of all Drosophila ENaCs identified by BLASTP search using human amiloride-sensitive cation channel 2, neuronal hBNaC2 (GenBank accession number NP 064423) protein sequence as a probe. (B) RT-PCR for putative EnaCs (left-hand panels) and corresponding Southern blots with probes specific to each gene (right-hand panels). Labels refer to known genes or to Gadfly-predicted genes. Size markers denote expected sizes from genomic (black arrows) and cDNA (white arrows) templates. The ladder is a Gibco BRL 1 kb ladder. The templates are as follows: Genomic, genomic DNA Whole fly, whole-fly cDNA Head, head cDNA Tubule, Malpighian tubule cDNA No template, no template (negative control).

Malpighian tubules do not express epithelial Na + channels (ENaCs). (A) Phylogenetic tree of all Drosophila ENaCs identified by BLASTP search using human amiloride-sensitive cation channel 2, neuronal hBNaC2 (GenBank accession number NP 064423) protein sequence as a probe. (B) RT-PCR for putative EnaCs (left-hand panels) and corresponding Southern blots with probes specific to each gene (right-hand panels). Labels refer to known genes or to Gadfly-predicted genes. Size markers denote expected sizes from genomic (black arrows) and cDNA (white arrows) templates. The ladder is a Gibco BRL 1 kb ladder. The templates are as follows: Genomic, genomic DNA Whole fly, whole-fly cDNA Head, head cDNA Tubule, Malpighian tubule cDNA No template, no template (negative control).

Watch the video: Protein Synthesis Updated (December 2022).