Information

Finding exons in DNA problem

Finding exons in DNA problem


We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

My attempt: I looked for the TACs because I thought this would be AUG in mRNA and ultimately Methionine (the start codon). But apparently, that's not how you do this problem. Im confused because the answer (shown in red) doesn't have any TACs and the boxes seem to start at random places. I am also confused as to why my professor drew a box around both strands. I thought only one strand at a time was turned into mRNA?

UPDATE: It has occured to me that the entire strand is transcribed and then the introns are removed. So I know why not to look for the TACs. But now, how can I identify the exons?


After an RNA has been transcribed, in eukaryotes it is spliced before it leaves the nucleus. This means that parts of the RNA are removed (called introns) and the ends are capped. The parts left over in the mature mRNA after removing the introns are called exons.

The mRNA does not have to start with a start codon. There can be sequences before and after the bit which actually gets translated.

Yes, only one strand of DNA is transcribed into RNA. When you look at the mature mRNA below, read the first few bases and try to find their complements in the original DNA because this is where they were transcribed from. UCAUG is transcribed from the DNA AGTAC (with TCATG on the opposite side).

Now you look for the end of the mRNA. Bear in mind that all the A's are the poly-A tail which is added during splicing and determines how long the mRNA will persist in the cytoplasm. So you look for CUAGG in the original DNA, which must be transcribed from GATCC (with CTAGG on the opposite strand).

As you can see, that's exactly where the two red boxes start and end.

Now since the mRNA you see here is mature (indicated by the poly-A tail and the 5'-cap), that means introns have already been taken out. So any bases that you can see in the DNA between the start and end that we just found but not in the mRNA must have been an intron. Or the other way round: all the bases that are in the mRNA which you can also find in the DNA must be exons.

You will notice that exactly the bit that isn't in the red boxes doesn't appear in the mRNA anymore. Or: All of the mRNA is in the red boxes. The sequence is interrupted by a short bit which you can't find in the mRNA anymore - so this must be an intron.


Problem: During "RNA processing"A. all of the exons are removed and discardedB. the RNA molecule is made from a DNA templateC. introns are cut from the RNA and the exons are spliced togetherD. the RNA molecule is translated into a protein molecule

RNA processing involve a number of modifications of the pre-mRNA molecule to create the mature mRNA ready for translation. Thus, this processes do not actually concern the actual translation process. Simply put, these are just the processes after transcription prior to translation.

Problem Details

A. all of the exons are removed and discarded

B. the RNA molecule is made from a DNA template

C. introns are cut from the RNA and the exons are spliced together

D. the RNA molecule is translated into a protein molecule

Frequently Asked Questions

What scientific concept do you need to know in order to solve this problem?

Our tutors have indicated that to solve this problem you will need to apply the Eukaryotic RNA Processing and Splicing concept. You can view video lessons to learn Eukaryotic RNA Processing and Splicing. Or if you need more Eukaryotic RNA Processing and Splicing practice, you can also practice Eukaryotic RNA Processing and Splicing practice problems.

What professor is this problem relevant for?

Based on our data, we think this problem is relevant for Professor Geiger's class at FIU.


Genome-wide detection of tandem DNA repeats that are expanded in autism

Tandem DNA repeats vary in the size and sequence of each unit (motif). When expanded, these tandem DNA repeats have been associated with more than 40 monogenic disorders 1 . Their involvement in disorders with complex genetics is largely unknown, as is the extent of their heterogeneity. Here we investigated the genome-wide characteristics of tandem repeats that had motifs with a length of 2-20 base pairs in 17,231 genomes of families containing individuals with autism spectrum disorder (ASD) 2,3 and population control individuals 4 . We found extensive polymorphism in the size and sequence of motifs. Many of the tandem repeat loci that we detected correlated with cytogenetic fragile sites. At 2,588 loci, gene-associated expansions of tandem repeats that were rare among population control individuals were significantly more prevalent among individuals with ASD than their siblings without ASD, particularly in exons and near splice junctions, and in genes related to the development of the nervous system and cardiovascular system or muscle. Rare tandem repeat expansions had a prevalence of 23.3% in children with ASD compared with 20.7% in children without ASD, which suggests that tandem repeat expansions make a collective contribution to the risk of ASD of 2.6%. These rare tandem repeat expansions included previously undescribed ASD-linked expansions in DMPK and FXN, which are associated with neuromuscular conditions, and in previously unknown loci such as FGF14 and CACNB1. Rare tandem repeat expansions were associated with lower IQ and adaptive ability. Our results show that tandem DNA repeat expansions contribute strongly to the genetic aetiology and phenotypic complexity of ASD.


Standard Nomenclature for Genes and Mutations

Figures 1 and 2 exemplify how to number nucleotides and name mutations or variants, respectively, according to the standard nomenclature recommendations of the HGVS (http://www.HGVS.org/mutnomen/). These numbering examples are based on coding DNA reference sequences and protein-level amino acid sequences. 𠇌oding DNA reference sequence” refers to a cDNA-derived sequence containing the full length of all coding regions and noncoding untranslated regions ֵ′ untranslated region (UTR) and 3′-UTR] splice variants may lack one or more of the coding exons. Nucleotide numbering is in relation to the translation initiation codon, starting with number 1 at the A of the ATG. Standard mutation nomenclature based on coding DNA reference sequences and protein-level amino acid sequences requires prefixes 𠇌.” and “p.,” respectively, as in Figure 2 . Standard nomenclature based on genomic DNA reference sequences and RNA reference sequences is not shown. “Genomic DNA reference sequence” simply indicates any human DNA sequence in the database that is not based on a cDNA sequence. Standard mutation nomenclature based on a “genomic DNA reference sequence” requires a prefix “g.” and numbering starts with number 1 for the first nucleotide in the file.

Example of nucleotide numbering based on a coding DNA sequence. Exonic sequences are numbered sequentially from the initiation codon to the stop codon. Untranslated sequences in the 5′- and 3′-UTRs, as well as in intronic sequences, are numbered in relation to the coding exonic sequences as shown. Note that lengths of DNA sequence are arbitrary.

Example of standard mutation nomenclature based on a coding DNA sequence. Note that the amino acid change for 𠇌.1A>T” is described as “p.0?” because amino acid changes secondary to codon 1 mutations are frequently unpredictable. In this example, c.1A>T cannot be described as “p.Met1Leu” because it either creates no protein or creates a different protein starting from a cryptic translation initiation site. One may describe the amino acid sequence change as “p.0” if there is experimental proof that no protein forms.

Figure 3 illustrates the process for finding a reference sequence that describes a novel mutation or for searching for the sequence surrounding a particular mutation. As shown in Figure 3 , it is essential to find and use the gene symbol approved by the Human Genome Organisation (HUGO) Gene Nomenclature Committee (HGNC http://www.gene.ucl.ac.uk/nomenclature/index.html). 7 ,8 A major problem has been the highly variable use of gene nomenclature in the literature, producing multiple symbols and names for one and the same gene 9 ,10 or one gene/protein symbol that stands for completely different genes or proteins. 11 ,12 ,13 ,14 Up to one third of human genes may have been affected by the homonym problem, 15 mainly because of the nonuse of HGNC-approved official gene symbols.

How to find a DNA reference sequence and HGNC-approved gene symbol. BLAST, Basic Local Alignment Search Tool HUGO, Human Genome Organisation NCBI, National Center for Biotechnology Information.

In addition to the use of the HGNC-approved gene symbol, one needs to find the most appropriate reference sequence for a novel mutation. The most appropriate reference sequence may be a coding DNA sequence based on full-length mRNA or a genomic DNA reference sequence. Even if one finds the mutation based on a reference sequence, it may not be the most updated or the most appropriate reference sequence. For example, the reference sequence that has been used to identify a novel exonic mutation might comprise the sequence of only one exon of the gene. In this case, it is appropriate to search for a coding DNA reference sequence based on full-length cDNA.


How to find an EXON - (May/01/2006 )

Hello every one,
I would be thankful if any one can help me in this.
In my project I have to find a mutations in a gene so I have to extract DNA and then go through PCR ..etc.
My problem now is that after I found the primer from the NCBI and from several papers , my supervisor refused to consider it because it was mRNA sequence and she wants the DNA sequence!!!.
Frankly I am not sure if this is correct or now, because I am still learning , but my first problem that I can not find such sequence from the DNA. Of course I can write the cDNA, but I am wondering if there is another way to get the DNA sequence for the primer.

My second problem , which is more difficult for me, that she wants to know that this sequence referring to which exons in the gene. Here I totally get lost because I do not know how to search for the exons.What I know is that the gene I am working about is contains 20 exons.

So pleaseeeeeeee can any one of you have any idea how he can help me.
Thanking you all,

I think maybe I can help you for the first part of your question for sure.

you can blast with your cDNA sequence to obtain the genomic sequence. right by the accession number, the next line is a description of the sequence source. you just find a sequence which contains the complete gene with the source being chromosomal DNA clones, not cDNA, and there you are. this only doesn't work if the area of the genome for which you search has not been sequenced yet. does this make sense? I'm not sure if I have explained it properly.

for the second part, I am not sure of the best way to help you? you can compare the cDNA with the gDNA and look for differences, but that will not necessarily take all factors into account and will only give you a rough idea

I think maybe I can help you for the first part of your question for sure.

you can blast with your cDNA sequence to obtain the genomic sequence. right by the accession number, the next line is a description of the sequence source. you just find a sequence which contains the complete gene with the source being chromosomal DNA clones, not cDNA, and there you are. this only doesn't work if the area of the genome for which you search has not been sequenced yet. does this make sense? I'm not sure if I have explained it properly.

for the second part, I am not sure of the best way to help you? you can compare the cDNA with the gDNA and look for differences, but that will not necessarily take all factors into account and will only give you a rough idea

plug your sequence into the box and wait for the matches to pop up. this is from NCBI's site are you not familiar with Blast?

sorry for being late.
It is OK now.
I know this option already but I found that my supervisor was trying to examine me so she wanted a very complex way to get the same result.
Any way thank you for your help and sorry for being late


Introns, Exons, and So-ons (Part I)

"An organism is built and maintained primarily by the actions of proteins coded by genes in the organism's genome. Superficial probabilistic assessments of whether a gene coding for a specific protein could simply occur by chance in the primordial pond have been profoundly discouraging. But these calculations fail to account for several significant characteristics of genes, described in Chapter 7*, that actually make their occurrence highly probable. In fact, these principles of genes cumulatively make it inevitable that a given gene sequence that can code for a specific protein would have been available in the universal sequence pool (USP). Since the expected mean length of the random sequence is the same for any given gene with typical characteristics, almost any gene coding for almost any protein sequence will occur within this expected mean length of the USP."

* animo acid degeneracy in proteins, codon degeneracy in genes, and the ease of finding short exons in random DNA.

We should note that all genes occurring directly in the USP were split into exons and introns -- typical of eukaryote genes. Finally, the notion that the very first cells must have been complex**, with nuclei -- typical of today's eukaryotic cells -- shows that these cells could have been formed directly from the primordial pond." (page 290)

** Computer analysis of DNA sequences reveals that the very first genes in the primordial pond were split into coding (exons) and intervening (intron) sequences." (page 230)

Also, refer see the bold quotations, below.

Discussion:

From Keith Robison: (quoting Dr. Senapathy) "In this context it should be noted that there are only three competing theories as to how [split] genes have originated on earth.

I don't really wish to argue this point, but it is curious that Senapathy has left out the other major explanation for split (intron-bearing) genes: that introns were inserted "late", after the divergence from a common ancestor. I don't suppose this would have to do with his statement:

"Finally it explains the absence of any correspondence between the domains of the proteins and the exons of the genes, exactly as shown in the recent study reported in the journal Science by Ford Doolittle's group.

Of course, this is also a prediction of "introns-late"! (but before I am branded a heretic,** I'll shut my mouth :-)

** -- Wally Gilbert is my advisor

From Steve LaBonne: (quoting Dr. Senapathy) Ford Doolittle and colleagues however now say that introns may have been inserted into contiguously formed genes, and support what is called the introns-late premise. But this premise also is untenable, because there is no logical basis for it.

Steve: What does logic have to do with it? Either it happened or it didn't. Now, it's clear that at least some self-splicing introns are rather old (viz. the Group I intron in the cyanobacterial/chloroplast leucine tRNA [UAA anticodon]). But there is now a perfectly good proposed mechanism for the late appearance of spliceosomal introns: to wit, that they have evolved from Group II introns that moved from organelles to nuclei after the endosymbiotic origin of organelles. I don't say this is proven, though there is now a lot of evidence supporting it, but it is absurd to say there is "no logical basis" for introns-late.

And why Senapathy seems to think that this particular debate is the key to the origin of life is quite beyond me. The aforementioned ancient Group I intron may well go back to the origin of the cyanobacteria that's indeed impressively old, but it's still a long way from the origin of life

JM: It is key because it goes to the probability of finding genes in the primordial pond. Long genes (not the watch company) would be nearly impossible to find, but genes that are broken into pieces (exons/introns) would not only be easy to find, but are inevitable. Have you read his book? If not, that is the way to understand his theory.

Steve: In that case, the preponderance of evidence for the late (postsymbiotic) appearance of spliceosomal introns, assuming it holds up (personally I believe it will), is in itself sufficient to torpedo Senapathy's theory. I'm afraid you can't have your cake and eat it too!

Now it is apparent to me why Senapathy had to be so airily dismissive of the evidence for the lateness of (spliceosomal) introns. If firmly established, then on your account introns-late would suffice to refute Senapathy's theory. Which means, of course, that Senapathy needs to address the now widely accepted scenario for the evolution of spliceosomal introns from Group II self-splicing introns that escaped from mitochondrial genomes (Cavalier-Smith, T. [1991] Trends Genet. 7: 145-148). Note that much biochemical evidence, published after Cavalier-Smith's proposal, supports the key contention that the mechanism of splicing in Group II and spliceosomal introns is extremely similar, and that the snRNA's in the spliceosome are equivalent to trans-acting bits of a Group II intron active site (i.e. the spliceosome + intron system is essentially a highly fragmented version of a Group II intron). Furthermore, what could be interpreted as early stages of such a fragmentation process have actually been observed in organelle genomes (Bonen, L. [1993], FASEB J. 7: 40-46). Also, at the time of Cavalier-Smith's original proposal, Group II introns were known only in chloroplasts and mitochondria but not in their (respectively) cyanobacteria and purple-bacteria ancestors that piece of the puzzle was also subsequently filled in (Ferat, J.-L., and Michel, F. [1993] Nature 364: 358-61).

Since introns-late pulls the rug out from under Senapathy's fundamental argument, I would be interested in seeing his response to this body of work.

From Periannan Senapathy: If you are interested in the topic concerning the origin of introns and split-genes, an article I have written on this topic has been published in this week's Science (2 June 95). I have made available a copy of this article and two other accompanying articles, a debate concerning the origin of introns and protein-coding genes, in the web page:

I think that this will answer many of the questions that people have asked recently in s.b.e. regarding the origin of genes. I will soon post some replies to the comments that have appeared here recently on my theory.

Periannan Senapathy
Genome International

From Keith Robison : In his letter to Science, Senapathy baldly states that eukaryotic exons have "an upper limit of 600 nucleotides (with rare exceptions)"

I have a dataset from GenBank 70 (about 4 years old) with which to check this claim. Caveats: Only the coding regions were included -- the 5' and 3' exons are truncated by the length of the non-coding region. Also, some exons may be misclassified at (5'+3') if the first and/or last exon wasn't recorded in the GenBank entry.

Now, "rare exceptions" lacks a quantitative definition, so I will leave it up to the reader to decide whether 2.5% or even 2.0% counts.

From Alix Martin: A large proportion of the human genome is constituted of introns. These sequence parts do not contribute to the making of the human organism, as no proteins are created from these sequences. Though, the human cells spend energy replicating these sequence parts. There is no obvious short term usefulness for introns. I will here suggest long term factors that justify their presence in our genome. One particular aspect is that the existence of non-coding DNA sequences might be necessary to allow macro-mutations during the evolution process.

As I'm no professional biologist, these ideas might have been around before, or might even be totally flawed. However, as mixing different scientific backgrounds is often a useful process in science evolution, I'll throw them in.

The Darwinian evolution theory has done a lot to explain evolution mechanisms. However, some people do not consider it as fully satisfying. See for instance Mark Ludwig's work on virtual evolutionary environments, simulating Darwinian processes in a computer. [ Wired 3.02/Computer Viruses, Artificial Life, and Evolution].

Along the evolution process, changes that are not purely incremental appear. For instance, fish grow feet, or human grow wings. -) My intuition is that for such a large change to occur, radically new proteins are needed to drive the animal's morphogenesis. These proteins need to be coded by a DNA sequence. On a Darwinian basis, the apparition of such proteins is linked to mutations in the DNA sequence introducing a new alelle in the species gene pool, that corresponds to the new protein. As mutations are rare, they are likely to occur one by one. My intuition is that important evolutionary steps need new proteins that differ from the old ones by more than one amino-acid. (I call this a macro-mutation). Because mutations are rare, there is a need for a evolutionary pathway between the proteins coded in the non-mutant species' gene pool and the new protein. Each step in this pathway is a one codon mutation in the DNA sequence coding the protein. If the sequence is an exon (a coding one), the mutant corresponding to each step must be a viable one, and the alelle needs to survive until the next mutation step occurs. I see this as an evolutionary tunnel until the useful sequence, coding for a useful protein is attained. For me, it is unlikely that such a tunnel can be crossed without generating a freak at one of the intermediate steps. One argument that can lead to thinking that the tunnels are more than one mutation long is that if a single mutation was sufficient to lead to a useful change, it would happen quickly and take over the entire specie. For me, this is what happens when humans grow taller, not when fish start walking. For me, there are two different time scales here.

A common practice in computer programming is to comment pieces of code that were useful at one time but are not any more. Not throw them away, but keep them as comments, even if they do not contribute to the program's function any more, as they might be useful again in another context or at a further step. Perhaps introns are just nature's way of giving genetic code a comment status. In computer programming, special signs delimitate informative parts of code, like /* COMMENT */ in C. Similarly, there are specific sequences of DNA that mark the beginning and the end of an intron.

There is no selection pressure on the portions of genetic code that are introns. Whatever mutations affect the DNA sequences contained in introns, they are not expressed as proteins, and therefore do not affect the fitness of the allele. Only when a mutation corrupt the start code of the intron does the mutated sequence become meaningful. Then, it is very likely that the new protein generated will be useless, or even will make the allele non viable, but from time to time, a macro-mutation that is useful will appear, and such a mutation could not have been reached if the intron mechanism had not existed. Introns allow species to go through evolutionary tunnels without being subject to selection pressure all way long. Of course, mutations in the start code of the introns are very unlikely, as these sequences are only a few codons long, but this is consistent with the long time scale under which important qualitative mutations affect species.

Thus, introns are not useless, but are a key factor in allowing life to perpetuate itself. It is a real long-term mechanism, just as "normal" mutations are a long-term mechanism, sexual reproduction a medium-term one, and scattered response threshold distributions in cells are a short-term adaptation mechanism (another story).

To verify all this, I would suggest testing that the entropy of non-coding sequences is higher than the one for exons. For this, all that is needed is a databank of a sequence of DNA among different individuals of the same population, the sequence containing both introns and exons. As there is no selection pressure in introns, the entropy should be higher.

From Keith Robison (in reply to Alix Martin): Just to clear up some definitions here. Introns are transcribed regions which are spliced out of mRNAs. While the proportion of the genome which is intronic is probably greater than the exonic regions, both are probably grossly dwarfed by the intragenic DNA (between genes).

One contender for an explanation [of the usefulness of introns] is that non-coding DNA generally has NO function -- it is just a "selfish" parasite which can be tolerated.

There are selection pressures [ on the portions of genetic code that are introns]. First, the signals for splicing introns are contained in the introns, and so there is a selection to maintain them. Second, those signals can't be compromised by conflicting signals -- so there is a pressure to avoid generating new splicing signals in inappropriate locations.

Actually you are mixing units ["Of course, mutations in the start code of the introns are very unlikely, as these sequences are only a few codons long,"] -- the term codon has no relevance in terms of intron splicing signals. Also, your hypothesis should include both "start-intron" and "end-intron" signals. Of course, one difficulty with extending an exon into the adjacent intron is that the extended exon must match in frame -- 2/3 of the time an exon-extension event will result in an untranslatable message.

You should define "entropy" precisely and describe how you will attempt to measure it. Also, there are factors which might confound your analysis. In particular, non-coding (both inter- and intra-genic) is largely composed of repetitive elements -- segments of DNA found frequently in the genome. Many of these elements are known to be capable of transposition (copying) within the genome.

Also, the databanks used to be heavily skewed towards short introns. With the advent of genomic sequencing, this bias is beginning to lessen but is being replaced by exons and introns which have been predicted by computer but not experimentally verified (not a good base for modeling).

Caveats aside, the field of intron-function is still pretty open. The datasets are only getting better, and so if you're interested in it you should plunge right in!

From Mark E. J. Newman: Another important point is that non-coding regions are important simply for the physical space they take up. The position in three-dimensional space that different coding regions occupy can have important consequences for transcription regulation, and the presence of non-coding regions can allow the coding ones to take up their proper positions. Thus the actual content of the non-coding regions may be unimportant, but their presence is crucial to proper action of regulatory mechanisms.

This type of mechanism ["comment pieces of code that were useful at one time but are not any more"] is seen in artificial evolution. Pieces of code become inactive and are reactivated to the organisms advantage later on. I should not be surprised to learn that it takes place in nature too, though I can't give you any specific examples.

Actually, some work of this nature has already been done [on "testing if the entropy of non-coding sequences is higher than the one for exons"]. Not on introns in particular, but on non-coding DNA. It was done by H. Eugene Stanley of Boston University and some co-workers whose names escape me, and it was in the last year, but other than that I can't remember where I saw it. The basic idea was to do an information theory analysis of the information content of coding and non-coding DNA as a function of length. The basic result if I remember it correctly, was that the NON- coding DNA had a "message-like" information content, i.e., increasing linearly with the length of the sequence analyzed, but that coding DNA did not - the information content increased slower than linearly. I'll see if I can find the reference anywhere.

Incidentally, in the particular case of introns which you were talking about above, I also know of at least one case in which an intron has a function, even though it is not translated. In that case, the physical presence of the intron in an mRNA that codes for a growth-promoter slows down the translation of the RNA (which cannot take place until the intron has spliced itself out). If you remove the DNA that codes for the intron >from the genome, you still produce the growth-promoter, but you produce it too fast, and tumor-like overproduction of cells can occur.

From Keith Robison (in reply to Mark Newman): There is a related story of certain developmental factors in Drosophila (e.g. string). These genes have enormous (>100 Kb) messages which take a very long time to transcribe. It turns out that at each cell division incompletely transcribed pre-mRNAs are destroyed. String's transcript is too long to transcribe completely during the first few Drosophila divisions, which occur in quick succession. As a result, a full length string transcript cannot be made until longer-period cell divisions occur. So the sizes of the introns help makes the developmental decision of when the string protein is made!

From Chip Young: I read somewhere, probably Science News , that the supposed non-coding DNA is quite stable. Close similarities from individual to individual.

Presumably, if it was really useless, it would mutate relatively fast since the environment wouldn't be weeding out errors.

It's relative stability suggests it does something, we know not what.

From Dave Oldridge: Senapathy simply does bad math. His error is identical to that of creationists who state that evolution could not happen because (for example) elephants are highly improbable.

JM: I understand your objection to Senapathy's math in that example. However, there are two important mathematical aspects to his theory: (1) the probability of point mutations creating new genes, and (2) the probability of eukaryote genes forming in the primordial soup. We've already run the discussion out on part 1. Now, tell me what's wrong with his numbers on part 2.

From Keith Robison: The problem is that Senapathy is playing parlor games, not proposing a workable model. Yes, if you stare at random text you can find a message in it -- but only because you know the message (or a message) to find in it. Biology doesn't work that way -- splicing of an mRNA is not guided to produce only useful mRNAs. There are signals within the original transcript which guide the splicing process.

So, what's important is not the likelihood that a message occurs somewhere in a random sequence after splicing it to fit, but whether the pieces of the message occur flanked by the correct splicing signals. To extend the example of "To be or not to be" from the book in a trivial manner, suppose the letter "Q" is both a start-splicing and end-splicing signal. The question is then what is the probability of finding

You could, of course, change the signal to anything you want, but remember that both your proposed exons must be flanked by signals and both proposed introns and exons must lack it. I think you will find that the statistics aren't much of an improvement over finding your whole target message in random DNA.

JM: Now, tell me what's wrong with his numbers on eukaryote genes.

From Keith Robison: There are signals within the original transcript which guide the splicing process. . So, what's important is not the likelihood that a message occurs somewhere in a random sequence after splicing it to fit, but whether the pieces of the message occur flanked by the correct splicing signals.

JM: That is not a problem. Dr. Senapathy uses 600 nucleotides as the typical exon length, and so adding a few more (specifically, 9 + 4 = 13) nts won't make much difference. An exon could be "defined" to include the start and end splices and the probabilities computed from that. 600 is still a good average length to use, but use 613 if you want. In fact, for all but the longest exon of a gene, the additional 13 nts won't make any significant difference to the likelihood of finding the complete gene in a random DNA sequence because the chances of finding the complete gene are only dependent on the length of the longest exon in the gene. The math behind this reasoning is discussed at the start of Chapter 7, and text strings are used there as example "eukaryote genes" (pages 222-230).

Keith: Senapathy is completely wrong here, and I'm surprised you have swallowed it. Because of the way splicing works, what is important is the frequency of splicing signals in the random sequence -- you don't form genes just be taking what exons you want. As I noted before, in his "to be or not to be" example, Senapathy shows no method other than finding a predetermined target. He skips over plenty of legitimate English words (exons), some ("awry") longer than the ones he chose.

JM: Also in Chapter 7, Dr. S. discusses the chances of finding long reading frames in random DNA, and he does an extensive analysis of the effect of the frequency of stop codons. So, contrary to your characterization, I don't see that he's playing "parlor games."

Keith: Then keep looking, or spend more time in the parlor ( :-). In every case (Figures 7.1, 7.2, 7.20) Senapathy first selects the message he is looking for, and then scans the random sequence looking for it. Lots of fun, but completely irrelevant to the field of biology.

Again, the way biology really works (and remember, Senapathy is saying that things can't have changed :-), is that the spliceosome moves down the transcript, and when it hits a "start-splicing" signal, that's the end of an exon. It then scans for a "stop-splicing" signal which marks the beginning of the next exon. The spliceosome knows nothing about open reading frames or phases. As I have pointed out before, the odds of successfully getting an ORF from a moderate number of exons is very low, as 2/3 of your splicing events will be out-of-phase.

. remember that both your proposed exons must be flanked by signals and both proposed introns and exons must lack it.

JM: He does not ignore that requirement. See pages 230-239 and 242-247 plus other places in Chapter 7.

Keith: Senapathy never really deals with this problem. The closest he comes is suggesting that somehow the splicing process can recognize regions densely populated with stop codons (p.245). Again, there is NO evidence for this, and contrary evidence.

How, in this model, can you explain very short exons?

In the current issue of Nature Genetics, there is a report of a coding-region mutation which causes a genetic disease, yet it does not change the predicted amino acid sequence. However, it turns out it generates a "start-splicing" signal, and hence that exon is prematurely terminated.

Senapathy's model shows no correlation with biological reality Splicing does not know about translation.

And BTW, the distribution of known exon sizes does not fit an exponential distribution (Stoltzfus et al got the distribution right in their Science rebuttal), and there is no "cutoff" of exon sizes at a location convenient for Senapathy (there are exons which encode 1000's of amino acids). Senapathy can't even get his supporting facts straight.

From Keith Robison: In every case . Senapathy first selects the message he is looking for, and then scans the random sequence looking for it.

JM: True. In his English-text example "genes," Dr. Senapathy looks for few specified sequences (and finds them all), but he does this to illustrate that any sequence will be found. He is not restricting the search to just those sentences. On the contrary, he is encouraging you to search for any sentence you want (with the sole requirement of the limit on longest word). The random, 3-billion character sequence Senapathy used is too long to publish, or even e-mail. However, you are allowed to manufacture your own random sequence, and you can and should use many more than one, as this would represent the abundance of DNA that was available (see below for some numbers).

Once the length of the longest exon in a gene (including the splice sequences and the signals that start and end a gene) is specified, Senapathy shows how to compute the length of the random DNA needed to assure that that gene and any other gene (with one restriction) will be found there. He shows that the amount of DNA so computed will be a reasonable amount (i.e., that amount would be many times less than the total amount of DNA available in the pond). The "one restriction" is that the length of the longest exon in those other genes must not be longer than the longest exon in the specified gene.

Keith: Again, this is utter hogwash. All he is proving is that you can find the sequence in there if you know what you are looking for that any biological system could extract it is another matter altogether. Senapathy's calculations are hopelessly naive the real calculation is much more difficult. But, in general, once you blindly transcribe random sequence and splice it at the randomly occurring splice sites, you will basically find it looks like the DNA you started with in terms of the trinucleotide (codon) frequencies -- i.e., this exercise is not a magical solution to finding long genes in random sequence.

As Arlin Stoltzfus has already pointed out, there is no particular reason to expect that the initial genes were particularly long. Genes have probably undergone a lot of fusion & rearrangement, yielding the modern long reading frames and exons -- even Senapathy admits this, because he must explain away the intron-less prokaryotic genomes.

JM: For those of your who do not have his book, Dr. Senapathy uses these numbers after taking into account the degeneracy of codons and amino acids:

Keith: As your calculation shows, Senapathy's pond contains 10^5-10^10 kilograms of high molecular weight, double-stranded DNA. Biological systems are quite capable of generating this a serious challenge for any abiogenesis scheme is generating the biomolecules (one was just published in Nature). Senapathy says "no problem" -- and then assumes it will be polymerized, double-stranded, and high-mw (or else his calculations croak from "edge effects" -- you can't run a long gene into DNA which doesn't exist). Furthermore, this DNA is being replicated, transcribed, and translated.

JM: Perhaps this is what you are getting at: Since two of the sentence examples used on page 229, "God heals, and the doctor takes the fee" and "Love is the wisdom of the fool and folly of the wise," plus many other sentences are all found in the same random text sequence, and since two or more such sentences could overlap in the random text, the actual sentence found might be something like "Love is the doctor the wisdom takes of the fool, the fee." However, unlike these word examples, the longest exon in each of two real genes will not likely be near each other relative to the locations of the shorter exons. That is, the two genes are unlikely to overlap because all of the shorter exons will be found very close to the longest exon. Besides, so what if a few of the specified genes do overlap? We aren't looking for any specific gene -- we take whatever we find and test it for viability. Win or loose, just keep going. Dr. Senapathy is only saying that the odds of finding eukaryote genes (and assembling them into viable genomes) are so high as to make it very possible, not nearly impossible.

Keith: And the point is, he has overestimated these odds grossly. He has led you down the garden path by equating splice signals with stop codons, when in reality what little resemblance is probably coincidental (BTW, the consensus for the end of an intron is Yag, where Y=T or C, but C predominates slightly but, why let the facts get in the way of a cool hypothesis).

JM: (continuing) . He is not making any statement here about the viability of any particular gene, just that there will be so many genes (viable or not) that a few will "survive" in the pond. He uses the characteristics of known viable genes as a basis for the computations.

Dr. Senapathy does not specifically include the gene start and end signals in his discussion, but I don't see why those signals could not simply be treated as a "null" length exon and included in the search. Since those sequences are short relative to the longest exon, they won't affect the amount of DNA needed to find them. You could argue that since they are so short, they'll be found so often as to goof up your search for a long gene. (Let me know if I'm helping you too much here. :-) Well, they might do that in many cases. But, there is also a reasonable chance that they won't occur. I don't have the numbers on this, but I suppose a couple of weekends of work would produce them.

Keith: You've missed the point -- entirely. English words don't have phasing mRNA translation does. There is also no real genetic equivalent to spaces -- splice sites are made of the same 4 letters, and their interpretation depends on context (i.e., an "end-splice" signal is irrelevant unless it follows a "begin-splice" signal). So the problem is that when you hit the next random splicing signal, odds are your translation will come to a halt.

He skips over plenty of legitimate English words (exons), some ("awry") longer than the ones he chose.

JM: But that's what makes this work. In the example, he is looking for specific words, but in reality whatever genes occur (along with the other genes forming a genome) may eventually get tested for viability. You didn't want him looking for any specific sequence, so you cannot let yourself do that either, and that includes specific sequences having rogue start/stop signals. Take any gene you find and test it for viability. That random DMA will contain many, many genes. The few viable ones will survive, and they will become more numerous over time.

Now, am I still missing something? If you think so, then can you use some numbers to refute Senapathy's numbers or logic?

Keith: We can divide his theory into two versions,

  • Senapathy petite Formation of genes by random splicing
  • Senapathy grande Independent Origin of all Species

Senapathy grande is fatally flawed on many levels, for a sampler:

    Many organisms development rules out the "seed cell" hypothesis

That leaves Senapathy petite (i.e. Senapathian formation of genes, but conventional organismal evolution). Even if we scale it down to meet my criticisms with regard to the negligible gain in ORF size, it is now an "exons-early" theory, and in general the exons-early boat has its gunwales at about the waterline.

In summary, Senapathy's book is a grossly flawed exercise in self-delusion. There is a great abundance of evidence to refute his big claims, and scaling down his claims doesn't put him in good shape either. Also, as a scientific theory Senapathy grande is utterly, absolutely worthless -- though I haven't quite decided if it is because it makes every prediction or no predictions (either way, useless). This stands in great contrast to evolution via common descent, which is a key theory in understanding biology and an important guide to real experimentation. If he hadn't paid to publish it, I would have assumed it was an elaborate parody of cargo-cult science instead it just is.

From Keith Robison (reprise): In summary, Senapathy's book is a grossly flawed exercise in self-delusion. Also, as a scientific theory Senapathy grande is utterly, absolutely worthless -- though I haven't quite decided if it is because it makes every prediction or no predictions (either way, useless).

JM: and suddenly you are doing an awful lot of ranting and raving, but you offer no numbers or information to support your claims that Dr. Senapathy is crazy. Your whole post was that way.

Keith: Okay, I'll admit I was cranky in that post. But the facts still stand: There is a great abundance of evidence to refute his big claims, and scaling down his claims doesn't put him in good shape either.

JM: What evidence refutes his claims? The evidence that supports macroevolution does not count -- you need to show what evidence there is that does not match Senapathy's theory.

Keith: Senapathy makes a number of claims about the statistical properties of introns and exons, which he says are a natural result of his theory (and therefore evidence for). To wit, Senapathy claims explicitly that exon sizes follow an exponential distribution, and his logic implies that intron sizes should follow a similar distribution.

Exon sizes follow a much more complex distribution (Stoltzfus et al got it right in their Science rebuttal). Intron sizes aren't exponential either (saw a presentation on this last week) -- they looked sort of normal-ish to me.

Senapathy's calculations are hopelessly naive the real calculation is much more difficult.

Keith: Again, the real calculation would have to consider the probability of splicing signals -- i.e., what does the distribution of ORF lengths look like after randomly-transcribing and then splicing the mRNAs. The only way I could do it is by simulation, which would be a bit more work than I'm willing to do. Never the less, we can make an intelligent prediction of the result (see below).

. this is utter hogwash. All he is proving is that you can find the sequence in there if you know what you are looking for

JM: No, he's showing that you can find any sequence (of specified limited length) in a large, given amount of random DNA.

Keith: But, in general, once you blindly transcribe random sequence and splice it at the randomly occurring splice sites, you will basically find it looks like the DNA you started with in terms of the trinucleotide (codon) frequencies.

JM: Yes, so what? I don't think Senapathy is saying otherwise, is he? Where?

Keith: Because he is saying that in a biochemical system, you can find genes in random DNA if you splice, but not if you don't splice. In other words, the splicing process somehow adds information content. But it CAN'T! Because a randomly-transcribed+spliced sequence pool has the same trinucleotide composition as the unspliced starting sequence, the splicing operation has done NOTHING to the probability of finding a long ORF. This is why all Senapathy's calculations are just smoke.

JM: I don't read it that way. It seems to me that Senapathy's random DNA looks like eukaryote DNA. For example, in figure 7.4 (page 236) he writes: "The only way a gene longer than 600 nts could originate was to select some short reading frames and splice them together . by editing out the intervening regions containing many stop codons. Such a splicing resulted in a long reading frame which could then code for a long protein. In today's biology, the short coding pieces which were spliced together are called exons, and the intervening pieces, the introns." That is, he is saying the short pieces and RFs (before splicing) are the exons and the other stuff make introns.

Where does he say you have to splice before you make the eukaryote gene (complete with introns)? The transcription and splicing is being done after the random hunk of DNA is put into a genome.

Keith: . previously you were quite adamant that the proper way to do such a calculation is to consider only the observed sequences, not the complete spectrum of potentially interchangeable sequences. Have you changed your mind?

JM: No -- we've changed the subject here. Our previous discussion was about point mutations and getting new genes therefrom. This is about finding any gene in a bunch of random DNA. I have not rejected your objections to the point mutation logic, but we reached an impasse there -- ultimately we both said that it didn't really matter (because, from your point of view Senapathy's model is wrong, and from my point of view there is no model). So, forget the point mutation discussion because it does not apply in any way to the main part of Senapathy's theory regarding the DNA in the pond.

Keith: Jeff, you have missed the point. When we discussed point mutations, you argued (and Senapathy used) that the correct question to ask was what was the probability of evolution arriving at the observed sequence, not the possibility of drawing any one of the possible isofunctional sequences. I am now suggesting that you (and Senapathy) remain consistent -- you must calculate the probability under Senapathy's model of drawing each of the observed genomes, not of drawing any one of the possible isofunctional genomes.

The underlying statistical logic is the same in both arguments, but Senapathy chooses the one to fit his purposes (and you have gone on ala-lemming). In other words, to remain consistent you & Dr. S. must calculate the probability of finding all of the current genomes in the soup. Hint: for the human genome it's 4^[-1*(10^6)] repeat for all remaining genomes, multiplying the probabilities.

JM: He has done this. With the longest exon specified, the length of random DNA is computed, and within that DNA will probably be found all possible exons of that length. In the word example, all 6-letter words will likely be found in the 3 billion random character sequence.

Keith: So how do you explain all those big exons (there are many greater than 400 nts, as I have posted -- as big as 7Kb as I recall).

JM: Senapathy uses 600 nts as a typical longest value, but agrees there are some that are longer. What's wrong with that? If his average longest exon frequencies are wrong, what are the correct numbers?

Keith: As I posted before, there are many longer exons. Calculate under Senapathy's model the number of exons you would expect to find greater than 600 nts in length.

JM: Note that most genes have longest exons of only 100-150 nts total DNA available in the pond = 10^30 to 10^35 nts.

Keith: (reprise) As your calculation shows, Senapathy's pond contains 10^5-10^10 kilograms of high molecular weight, double-stranded DNA. Biological systems are quite capable of generating this a serious challenge for any abiogenesis scheme is generating the biomolecules (one was just published in Nature ). Senapathy says "no problem" -- and then assumes it will be polymerized, double-stranded, and high-mw (or else his calculations croak from "edge effects" -- you can't run a long gene into DNA which doesn't exist). Furthermore, this DNA is being replicated, transcribed, and translated.

  1. WHERE DID ALL THAT DNA COME FROM?
  2. How is it all so damn long Senapathy's calculations assume the DNA as one long strand, or at least each strand is much, much longer than a eukaryotic gene. As I have pointed out before, maintaining DNA of such length is a challenge, as DNA is not structurally very strong and will easily break.
  3. Where did the transcription machinery and splicing machinery come from?

JM: Random chance, just like everything else. Once things began to work, the machinery was replicated (more often than the stuff that didn't work).

I don't see why it's necessary that the random DNA all be in one long piece. If it was churning about and various long pieces were formed, even briefly, then broken and formed in another sequence, wouldn't that work, too.

Keith: (reprise) You've missed the point -- entirely. English words don't have phasing mRNA translation does. There is also no real genetic equivalent to spaces -- splice sites are made of the same 4 letters, and their interpretation depends on context (i.e., an "end-splice" signal is irrelevant unless it follows a "begin-splice" signal). So the problem is that when you hit the next random splicing signal, odds are your translation will come to a halt.

JM: OK, it halts. Keep going and won't it eventually restart?

Keith: Nope. Not usually. In bacterial systems, it does frequently restart if there is a start sequence nearby -- but it starts a new peptide chain!! (stop codons are really "stop translation and release peptide" codons) Very rarely will eukaryotic ribosomes restart, and again, it will be a separate protein.

And anyway, this doesn't really matter. Remember, Senapathy is claiming (except where he needs it) that evolution is impossible -- i.e., under his model each genome looks almost exactly like it did the day it emerged from the soup.

And again, Senapathy is explicitly claiming that splicing is the route to long ORFs. What you are attempting to do is find a way around this -- not disputing that Senapathy is wrong. Senapathy is claiming that splicing builds big ORFs, and he's just plain wrong.

JM: If you think Senapathy is wrong, then please correct that part of his theory and change the numbers and recompute the amount of DNA needed. If you are correct, the number will be huge and the DNA unobtainable. That would be a lot of work and it may not be reasonable for you to do the math, but that is what you must do to show that the amount of DNA is not sufficient to satisfy Senapathy's theory.

Keith: I don't need to -- Senapathy has already done it for us, but mislabeled it. Because the spliced sequences look just like the unspliced sequences at the level of ORFs, we can use his calculation: 10^120 nucleotides (to find 200nt ORFs at high frequency).

JM: Just saying he is wrong (more precisely, you said: "Senapathy's book is a grossly flawed exercise in self-delusion. Also, as a scientific theory Senapathy grande is utterly, absolutely worthless") is not a very convincing argument. Why can't you quantify your arguments as he has done?

Keith: (reprise)And the point is, he has overestimated [the odds of finding eukaryote genes] grossly. He has led you down the garden path by equating splice signals with stop codons, when in reality what little resemblance is probably coincidental.

JM: How should it work? What are your numbers on those odds?

Keith: See above. The point is that his numbers are absolutely meaningless . All those impressive calculations -- irrelevant. Do you start to understand my frustration with this issue? Senapathy snows the readers with all this stuff, when it is completely pointless.

(reprise) Pray tell what is applying the selection. According to Senapathy's model, no selection occurs until the whole mess is assembled into a "seed cell" (itself a horrifically-flawed concept at odds with much established fact).

JM: What is the established fact that refutes Senapathy's seed cells?

Keith: I've posted this before.

    Many metazoans ("multicellular animals") have developmental schemes which require the asymmetric localization of proteins and mRNAs in the ovum. These patterns are laid down by cells in the mother's body. This solves the problem known as "symmetry-breaking" -- how can an apparently symmetric egg generate an asymmetric organism.

Senapathy's seed cells would have no such external pattern to impose asymmetric distributions of proteins and RNAs. Furthermore, if a seed cell could develop without them there would be no reason to expect the current requirements for them.

Again, Senapathy's pond could not generate such controlled heterogeneity, and we would not expect it to occur if Senapathy's pond-mammals could emerge without it.

(reprise) No, what is most important is that it is unlikely that once you find your gene, that it will have the appropriate constellation of regulatory sites to be expressed in a useful manner.

JM: Isn't this built into Senapathy's DNA? That is, there is a certain probability that you will find the splice signals and so on, and in the proper phase, in the random DNA. If Senapathy is wrong and if you think he's ignored this in his numbers, then please offer some alternative numbers, showing what the result would be if Senapathy had done it "right."

Keith: Again, Senapathy is so good with equations because he picks trivial ones. Doing a good transcriptional signal prediction is HARD -- you must make a lot of assumptions about probabilities (some of these things have very weird locational properties -- still not well understood). I don't have the numbers, and so I won't make pretty-yet-meaningless equations. But, given that there are probably 10^4-10^5 different transcriptional patterns in a human, Senapathy has underestimated things by at least that factor.

(reprise) Again, how is selection acting within the magical pool, when the functions of the genes can't be tested until they exit the pool?

JM: Selection of viable genes is happening outside the pond. Replication is happening inside and outside.

Keith: Selection must be coupled with replication in order for this process (generally called Darwinian selection) to have any effect.

(reprise) All abiogenesis mixtures are suicidal (hint: which is simpler to form, a fully independent organism able to meet all its needs or a free-loader which slurps the soup?)

  1. It takes many genes to make complex biomolecules.
  2. The genes to utilize complex biomolecules are common to all life.
  3. 1 & 2 imply that there are fewer required genes for building a scavenger vs. a synthesizer.
  4. The probability of an organism emerging from a pond is proportional to the number of required genes.
  5. 3 + 4 imply that scavengers will emerge more frequently than synthesizers.
  6. A single mutation in a synthetic pathway can knock it out,
  7. Deleterious mutations are frequent
  8. 6+7 imply that synthesizers will frequently mutate to scavengers.

Ergo, scavengers will emerge from the soup. Such scavengers will devour the soup, and while doing so kick out enzymes which degrade the components of the soup (this is a fact of life -- dipping your fingertips into Senapathy's pool would be genomicide on a large scale). Such scavengers would consume the soup.

Conclusion: no abiogenetic soup can long survive abiogenesis.

Jeff, I am wearing out. Let me put it this way -- what does Senapathy's theory NOT predict. And again, Senapathy's retrodiction/ accommodation of homology is based on his genome recycling theory, which I have pointed out is not compatible with the known properties of DNA. I have also pointed out that Senapathy's soup cannot coexist with decomposers, and such decomposers have been around for millennia.

I try to stay calm, but his book is just so maddening! The reason it looks good (in the book) compared to evolution is evolutionary theory is real science with all of its warts and shortcomings exposed, debated, and analyzed. Senapathy presents a glowing picture, failing to present one flaw in the theory. I (and others) have presented many, and they are glaring. Once you clear away all the false statements and "spherical cow" assumptions, there isn't much left of the book.

Put in a fair fight with modern evolutionary theory, Senapathy's theory just doesn't offer any promise.


Finding exons in DNA problem - Biology

Exons, Introns, Codons, & their equivalents


Three common technical terms in molecular genetics, exon, intron, and codon, have specific technical definitions, but are often miss-used in hurried or short-hand presentations. The main thing to remember is that exon and introns are features of DNA, whereas codons are features of RNA. Homologous sequences in the other type of nucleic need to be called something else, otherwise there is a danger the roles of DNA and RNA in the Central Dogma ("DNAmakes RNAmakes Protein") will be confused.

By definition, exons and introns are sequences in a protein-coding gene region of a double-stranded DNA molecule (dsDNA) that are expressed as proteins, or intervening sequences not so expressed. The exons and introns are typically shown as the single-stranded sequences of the Sense Strand of the dsDNA, written 5'-3', left to right.

Transcription of the complementary Template Strand produces a heterogeneous nuclear RNA (hnRNA) that is identical (co-linear) in 5'-3' orientation and base sequences to the DNA Sense Strand, with the substitution of U for T. The RNA sequences equivalent to the DNA exons and introns are sometimes themselves referred to as "exons" and "introns," however this is technically incorrect and also confuses their functional role in transcription and translation with exons and introns as gene sequences in DNA. The RNA sequences equivalent to to DNA exons and introns can be referred to as "exon transcripts" and "intron transcripts," or "equivalents," respectively.

Processing of the hnRNA to mRNA involves excision ('splicing out') of the intron transcripts and ligation of the remaining exons. Once the final mRNA is formed, translation is the process of reading (as amino acids) a series of three-base sequences called codons. Codons are read according to the Genetic Code, which is an RNA code. Because the mRNA region is equivalent to DNA exon, the same series can be identified in the Sense Strand (substituting T for U). The three-base DNA motifs are some called "codons", however this is again technically incorrect and confuses the information content of Genes with the function of RNA in the Genetic Code. The DNA equivalents to codons can be referred to as 'triplets.'

In bioinformatics, the 64 triplets are sometimes presented as a "translation table" that can be used directly with the DNA Sense Strand sequence to infer the protein sequence. This is practical, except that "translation" here means 'extraction of coded information' is not the same as the molecular process of mRNA translation.


Introns, Exons, and So-ons (Part II)

From Keith Robison: Okay Jeff, I think you are close to understanding the argument, but it is still eluding you.

    It is improbable to find ORFs in random sequence. "The only way a gene longer than 600 nts could originate was to select some short reading frames and splice them together by editing out the intervening regions containing many stop codons."

Do you understand? He is claiming that the splicing process enables the formation of long ORF-bearing sequences. The long ORFs are in the spliced mRNA, not in the DNA. But because the initial sequence is random, the splicing signals will be randomly distributed. And because the splicing signals are much bigger than translational stop signals, and unrelated to them, the output spliced mRNA sequence will look statistically like the input random DNA sequence. So there must be another source of information in order for this to work.

In brief, your chance of finding a long ORF in the spliced mRNA transcribed from random DNA sequence is identical to the chance of finding a long ORF in the unspliced random DNA.

From Wesley R. Elsberry: I'm intrigued. Why do you think the timing of the cut & paste job makes a difference, such that Robison's point no longer applies?

JM: It is important to our discussion because the random DNA looks like eukaryote DNA. If the splicing was done in the pond, before the seed cell was formed, then the genes would not have had exons and introns. Dr. Senapathy's theory and introns-early are very closely related.

From Keith Robison: (reprise) But because the initial sequence is random, the splicing signals will be randomly distributed. And because the splicing signals are much bigger than translational stop signals, and unrelated to them, the output spliced mRNA sequence will look statistically like the input random DNA sequence. So there must be another source of information in order for this to work.

JM: So, if the stop codons that would end a gene are not related to the splicing signals that start and end an exon or intron, then, after splicing, there would still be stop codons all over the place in the exons (and so it really wouldn't be a gene). I believe that summarizes your point, so you have no doubt now that I understand your point.

(reprise) In brief, your chance of finding a long ORF in the spliced mRNA >transcribed from random DNA sequence is identical to the chance of >finding a long ORF in the unspliced random DNA.

JM: All this assumes that the stop codons and the splicing signals are not related to each other. However, starting on page 244, Senapathy explains that the splicing signals are related to the stop codons and that the splicing mechanism must have come about through a selection process so as to achieve this relationship. He writes: "This system [of distinguishing between exons and introns] must have been primarily able to distinguish between what is a reading frame and what is a stop codon." Continuing on page 245 he shows that stop codons are correlated with the splice sites and that "the mechanism that identified genes consecutively selected its successive exons by looking for stop codons while reading a random sequence from 5' to 3'. . the splice junction sequences which contain these stop codons must have originated due to these reasons, and serve as molecular signals for the exon-splicing process."

This may sound like he's imparting an intelligence to the splicing process, but that is not so, just as there is no intelligence behind the putative workings of natural selection. Senapathy is saying: (1) we see long reading frames in life, (2) it is apparently necessary to have long reading frames for life (at least life as we know it), (3) the splicing mechanism that works must be one that results in long reading frames, and (4) this is confirmed by finding a correlation between the locations of stop codons and the "resulting" splice signals. If this particular mechanism (or some other viable one) had not come about, we wouldn't be here to ponder it.

Keith: Senapathy is just plain wrong. For a careful analysis of splicing signals, see:

  • Stephens & Schneider, J Mol Biol , 228: 1124-1136 (1992)
  • Particularly the sequence logo ftp://ftp.ncifcrf.gov/pub/delila/SequenceLogoSculpture.ps
  • Sequence logos are explained in http://www-lmmb.ncifcrf.gov/

Looking at the logos tells us several things:

    The donor ("start-splice") signal has the consensus

Yes, you can find all both TAA and TGA stops here, but of course only about 50% of the time. Furthermore, the stop would lie in phase 1 (between the first and second bases of a codon), and there is a slight excess of phase 0 introns. So, for the majority of the data (phase 0 + phase 2 introns > 50% of all introns), this is a poor explanation

with C almost equiprobable -- but C predominating. Again, the resemblance to a stop codon is tenuous.

In any case, this is only dealing with the probability of finding an ORF. Senapathy's fantastic statistics claim you are likely to find isocoding sequence for known proteins. But, since the spliced random sequence has the same information content as the unspliced random sequence, the splicing process has gained nothing. We can look at known proteins and calculate their information content, which can thereby be converted to the probability of finding them in random sequence. This has been done quite nicely by Hubert Yockey, and the probability is vanishingly small.

(Side note: Yockey's book is doubtful of all origin of life scenarios on similar grounds).

From Don Cates: What does Dr. S say about the fact that differences in redundant bases in codons mimic quite well the morphological relationships across many species. E.g. Take the code for some almost universally used enzyme. There are many different base sequences that can code for the enzyme. It happens that the closer two species (or sub-species or even individuals) are evolutionally, the more the sequences are alike.

JM: Dr. Senapathy spends a lot of time talking about codon degeneracy, mostly in terms of that redundancy making the probability of finding genes for particular proteins more likely. However, as to using these redundancies when looking at similar genomes he writes (at page 434):

"Evolutionary geneticists deal with an inherent problem when they analyze protein similarities looking for assumed evolutionary relationships. They start with a prior, strongly-rooted notion of evolution. Therefore, according to them, those proteins with functional similarities have evolved from one another. Consequently, they expect the proteins to have structural similarities and sequence similarity. So if they find sequence similarity between two functionally-similar proteins or genes, they believe that it is a direct proof for Darwin's theory of evolution."

"Because evolutionists expect two proteins which are functionally similar to be evolutionarily related, they look for sequence similarity even before one knows whether these proteins have sequence similarity. When a sequence similarity is found -- which is expected simply because of the functional similarity even without evolutionary connection -- they confidently provide it as evidence for evolution having occurred. On the other hand, if there is little or less significant sequence similarity, they try to bend the methods of aligning or searching for similarity of sequences in order to "improve" the similarity."

And at page 438: "In analyzing the coding sequence of a given gene found in many organisms, there exists a phenomenon concerning the variations of codons. If we take one gene and analyze its coding sequence in many different organisms, we naturally find sequence variations. [snip] Usually there are three or four codons, with the same first two bases but different third bases, that code for the same amino acid. As a result, if we analyze the frequency of the nucleotide differences at the three possible codon positions in the sequence of a gene from many different organisms, they vary most at the third codon position, less at the second and the first. . this phenomenon can arise when organisms were independently born -- by mutational changes of the same gene in each organism without altering the basic function of the protein . or if two gene sequences coding for functionally the same protein arise independently of each other. But evolutionists believe that this phenomenon is due to the evolution of organisms from one another."

Don: This is the stuff I was looking for. Please note that, as far as I can tell, Dr. S's theory would predict that the distribution of differences in the third base of these codons would be random across the different "independently born" organisms. However, this is not what is observed. Organisms that are considered to be close evolutionarily are more likely to have a higher proportion of same "third bases".

. if two gene sequences coding for functionally the same protein arise independently of each other. But evolutionists believe that this phenomenon is due to the evolution of organisms from one another."

Don: Again, the point is not that these similarities exist. They would also exist in "special creation" if the creator used the same basic blueprints for all its creations. What is important is the pattern of the differences across different organisms. This pattern is completely consistent with evolution but requires some sort of special pleading for both Dr. S and creationists.

Do you see why I think that this information poses a problem for Dr. S (and creationists)?

JM: I think the special pleading is on your side. Although Senapathy did not use the word "pattern" (because he would say there is no pattern), is it not true that the pattern you mentioned has been the primary foundation for the evolutionary tree? If so, then you cannot argue that the evidence supports evolution when it was that evidence that was used to create the tree!

Don: Arlin Stoltzfus countered this argument quite succinctly a long while back. The important "pattern" we see is that when we superimpose trees generated from different data (e.g. different genes, different morphological features), they are nearly always congruent. This is what Senapathy's "independent birth" theory cannot explain, except by the "special pleading" of genome reuse.

From Keith Robison: . immutable you say, I don't think that you mean immutable, since genomes are clearly shown to be "plastic" in many ways.

JM: Here, the term "immutable" means that no new genes could come about.

Keith: And again, Senapathy bats .000 here. We know of examples of new genes arising (e.g. jingwei), and know of many more mechanisms which could form genes. Note that Senapathy's scenario must invoke a lot of some of these mechanisms in order to explain away some unpleasant facts.

For example, bacterial genomes are mostly intron-free and some eukaryotic microbial genomes are either intron-free or intron-poor (as are organellar genomes). Senapathy must invoke large amounts of intron-loss through exon-fusion. But there's no particular reason two exons of the same transcription unit must be fused -- fusions could just as easily occur between unrelated genes. Each such fusion is potentially a new gene, with new properties.

And so the trend continues. Senapathy's assertions are almost universally either contrary to known data or require implausible assumptions.

JM: I cannot find any discussion in The Book about how the introns were deleted to form prokaryotes. Did I miss that? If not, why do you assume there is only one method to remove introns? I guess my problem here is: are you assuming a particular mechanism to remove the introns, and why must that be the method Dr. Senapathy would have to use when he does not even discuss this?

Keith: I'll have to dig through it, but I believe its there. In any case, it's mostly not a question of mechanism. If Senapathy is right, then somehow all those introns must have been lost, and that alone represents an enormous degree of evolution.

There are basically two ways of losing an intron. One, recombination between the genome and a reverse-transcribed mRNA, can potentially "cleanly" excise introns. The other possibility are genomic deletions excising the intron.

Note that both mechanisms, within the known properties of genomes, are likely to lead to some degree of novel gene formation. While recombination with a reverse-transcribed mRNA would generally tend to cleanly erase introns, the presence of repetitive sequences in the mRNA (not unheard of) could cause recombination elsewhere in the genome, leading to new, chimaeric genes. Similarly, deletions are likely to cause fusions between adjacent genes. Either way, at some frequency new genes will be acquired by the genome and made available for evolution.

From Ralph M Bernstein: There aren't too many theories on how introns were lost, that's why. Can you and Senapathy propose another method? The best one that I know of is WF Doolittle's "genome streamlining" -- very simply: because of faster replication times and less need for the regulatory aspects of introns, they were "streamlined" out.

From Keith Robison: It is interesting to ask that even if Senapathy could get the ORF-probability calculations right, what is the probability of finding a particular gene in a Senapathian pond -- is it anywhere in the ballpark of Senapathy's calculations.

In his book Information Theory and Molecular Biology , Hubert Yockey calculates the information content of the protein cytochrome c. That is, based on an alignment of many cytochrome c's, we can estimate the degree of plasticity allowed -- how much change can the protein tolerate and still function as cytochrome c. The information content is directly convertible to the probability of finding a cytochrome c sequence at random from an ORF of similar length.

iso-1-cytochrome c has an information content of 373.6 bits. Therefore, the probability of finding a cytochrome c at random is

Real data is not kind to Dr. S.

JM: By "ORF," do you mean a long, open reading frame of a gene (w/o introns) or just the reading from of an exon? If you mean of a gene, then your calculation has nothing to do with the probability of finding a part of that gene, an exon, which is what Dr. S is computing. If you mean an exon, then the probability calculation (chance of finding a given exon in a run of random DNA) is straightforward and I don't see how the esoteric use of information content is helpful -- how does it apply?

Keith: This calculation is estimating the probability of finding a cytochrome c once you have generated a translatable mRNA.

Jeff, I'm surprised at you. You are always calling for rigorous estimates. That is exactly what the information theory approach tries to be -- a rigorous estimate of the probability of finding a functional cytochrome c sequence in a mountain of random peptide sequence. Senapathy can splice and dice all he wants -- but unless you believe the splicing process can generate >10^100 possible messages we ain't going to see a cytochrome c (which would be a pretty good trick with 10^30 nucleotides!).

From [email protected]: A new introns-early theory is presented in the September and November issues of Molecular Biology and Evolution , (volume 12, 949-958 for the September issue entitled: "A stem-loop 'kissing' model for the origin of introns and recombination".

About a year ago Nature rejected the following letter on the topic which might be of interest to readers of this newsgroup.

ALTERNATIVE INTRONS-EARLY THEORY

SIR - In his New & Views item entitled "The uncertain origin of introns"(1) Laurence Hurst presents some of the arguments for "introns early" (the Gilbert school(2) and "introns late" (the Stoltzfus school(3). Both schools seem not to have noticed that introns interrupt both coding and non-coding parts of genes(4). It has long been known that genes for rRNAs and tRNAs contain interruptions, but these may be special cases. Recently, however, "mRNAs" have been discovered which have no protein product. The corresponding genes look like most protein-encoding genes, and possess multiple introns(5). Thus, introns interrupt genetic information, not just protein-encoding information. It is not too surprising then, that it is difficult to associate exons with domains of protein structure or function(2,3). It does not follow that this disposes of the introns early viewpoint. There may be other exon theories of genes, as well as "the" exon theory of genes (i.e. "the" introns early theory).

One alternative exon (introns early) theory can be derived from the growing evidence for involvement of stem-loop structures in recombination(6-12) a process which should have arisen early in evolution. In the early "RNA world"(13)it is likely that exchange of segments between protypic replicators would have been advantageous(14). Thus, if it were possible for recombination to have arisen early, it would have done so. Mutations which favour recombination would have affected either the enzymes (ribozymes) involved in recombination, or their substrate, RNA itself (hence stem-loops). Eventually the RNA world gave way to the DNA world, but stem-loop potential remained. Consistent with this, stem-loop potential is abundant and widely dispersed in modern genomes(12).

The basic postulate of the proposed alternative exon theory of genes is that stem-loop potential was widespread in genomes from an early stage. Information for new functions as they arose had to compete with the information for the stem-loop-forming function (i.e. complementary bases in the stems). In the case of protein-encoding functions the conflict was managed in three ways. First synonymous codons were used so that a sequence could at the same time both optimize its folding propensity and encode a protein. If this failed, then conservative amino acid exchanges were accepted to widen the range of codon choice without impairing protein function. Finally, if these failed, the protein was permitted only to evolve in segments interrupted by regions of high stem-loop potential. Remarkably, traces of this primitive arrangement can be discerned in some modern genes(12). In the compact genome of C. elegens stem-loops are abundant and 43% of these occur in introns, which represent only 20% of the genome(15).

  1. Hurst,L.D. Nature 371, 381-382 (1994).
  2. Gilbert,W. & Glynias,M. Gene 135, 137-144 (1994).
  3. Stoltzfus et al. Science 265, 202-207 (1994).
  4. Hawkins, J.D. Nucleic Acids Res. 16, 9853-9905 (1988).
  5. Pfeifer, K. & Tilghman, S.M. Genes Devel. 8, 1867-1874 (1994).
  6. Sobell, H.M. Proc. natn. Acad. Sci. USA 69, 2483-2487 (1972).
  7. Wagner, R.E. & Radman, M. Proc. natn. Acad. Sci. USA 72, 3619-3622 (1975).
  8. Doyle, G.G. J. Theor. Biol. 70, 171-184 (1978).
  9. Kleckner, N. & Weiner, B.M. Cold Spring Harbor Symp. Quant. Biol. 58, 553-565 (1991).
  10. Kleckner, N., Padmore, R. & Bishop, D.K. Cold Spring Harbor Symp. Quant. Biol. 56, 729-743 (1993).
  11. Reed et al. J. Mol. Evol. 38, 352-362 (1994).
  12. Forsdyke, D.R. FASEB.J. 8, 1395A (1994).
  13. Joyce, G. F. & Orgel, L. E. The RNA World, 1-25 (Cold Spring Harbor Laboratory Press, New York, 1993).
  14. Bernstein, C. & Bernstein, H. Aging, Sex and DNA Repair, (Academic Press, San Diego, 1991).
  15. Wilson et al. Nature 368, 32-38 (1994).

If you look at the current phylogenetic distribution of spliceosomal introns, they are restricted to eukaryotic genomes. Currently the best estimate of global phylogeny is a tree where the root lies between the eubacteria on the one hand, and an archaebacterial/eukaryotic clade on the other hand. Both eubacteria and archaebacteria lack ANY spliceosomal introns. So if introns were present in the common ancestor of all these lineages they must have been COMPLETELY extinguished in both eubacteria and archaebacteria. If one then considers what is known about eukaryotic phylogeny, and one considers information about the frequency of introns in eukaryotic lineages. it becomes clear that high intron density (more than 4/kilobase) is restricted to recently evolved clades such as animals, plants and SOME fungi. The deepest eukaryotic lineages- protists like Giardia, Trichomonas, Trypanosomes, Entamoebids and Heteroloboseans either completely lack introns (as far as we can tell now) or have them at very low densities. Thus, multiple independent outgroup lineages to animals, plants and fungi appear to have few if any introns. It is likely that the common ancestor of all eukaryotes, if it had introns, had very very few (much less than 1 per kilobase of mRNA). The alternative "introns early" interpretation is that introns keep on getting cataclysmically lost multiple independent times in evolution, yet are mysteriously retained in the common ancestors of all of the eukaryotic lineages. This is just not very parsimonious. We would not wish to argue that fingernails were ancestral to all life simply because some vertebrates have them -- I suggest that we shouldn't argue that high intron density is ancestral to all life simply because some recently evolved eukaryotic clades have it.

The problem with any introns early theory which concerns itself with spliceosomal introns, is that the phylogenetic evidence suggests that they are NOT ancient. this doesn't mean that stemloops couldn't have played an important role in the RNA world -- its just that they likely never turned into spliceosomal introns.

From Arlin Stoltzfus: (quoting D.R. Forsdyke ): "Thus, introns interrupt genetic information, not just protein-encoding information. It is not too surprising then, that it is difficult to associate exons with domains of protein structure or function(2,3)."

Arlin: Uh, it was surprising to the many who believed that protein genes evolved originally by combinatorial assembly of exons, each exon contributing to some discrete structural or functional feature of the protein. This view, which was called the "introns-early" view until about a year ago (:->), was presented as almost-established-fact in several textbooks from the 1980's.

Quoting Forsdyke: "It does not follow that this [lack of correspondence] disposes of the introns early viewpoint."

Arlin: This is not what was argued in ref. 2. Instead, it was argued that the weight of phylogenetic evidence (among other things) strongly favored a recent origin of spliceosomal introns, given that they are found only in some eukaryotes. To propose that spliceosomal introns as a family are ancient is like proposing that meiosis or mitochondria or microtubules are ancient. No one would even consider such a view unless there were some compelling logical or empirical grounds to doubt the clear (phylogenetic) evidence that these are derived characters. In the case of introns, it was felt that there really was specific evidence -- namely a general exon-protein correspondence -- that could only be accommodated by an introns-early view. The point of ref. 2 was that the absence of any reliable evidence for such a correspondence, though it does not constitute proof, deprives the introns-early view of its only evidentiary argument.

Quoting Ralph Bernstein: I think the point of this was to shore up the idea of introns early. The 'kissing-loop' idea is a really strong support of this concept."

Arlin: I fail to see how this "shores up" the introns-early position. It has been adequately demonstrated by Forsdyke and others that phylogenetically widely dispersed genomes have a statistical excess of inversed-repeat sequences over random expectations, even when local base composition is taken into account. This includes organisms with and without introns.

In organisms with introns, Forsdyke suggests on arguable grounds that inversed-repeats are more common in the introns than the exons This is interpreted, again arguably, to mean that the introns were always there, and (again arguably) that they exist for the sake of containing the inversed-repeat sequences so as to stimulate recombination.

The conclusion that the excess is due to selection is arguable because the alternative that the inversed-repeats arise (whether in introns, in exons, or in intronless bacterial genomes) due to mutational biases is simply never addressed. Instead, it is assumed that all deviations from randomness must arise from selection subsequent to mutation.

The suggestion that the introns are ancient is gratuitous. It is clear from phylogenetic comparisons of intron-containing genes that most intron positions are recent acquisitions. The minor conclusion that inversed-repeats are more common in introns than in exons is also arguable because Forsdyke (see his paper in the most recent Mol. Biol. Evol.) must exclude long exons in order to support this conclusion statistically. His rationale is that the long exons are long because they were able to evolve the requisite inversed-repeat sequences without including introns. The problem with this sub-division of the dataset is that, if one holds to the strict no-insertion introns-early position, there is no such thing as an ancestral long exon. When homologues of a gene are sequenced from many different organisms, large numbers of different intron positions are found (e.g., 45 in GAPDH, 24 in TPI, probably 70 in tubulin, 40 in actin, ca. 20 in SOD, etc. -- with more introns being found every month in newly sequenced genes). A "long exon" in one organism is broken up many times by the intron positions found in homologues, whereas Forsdyke's view would imply that a long inversed-repeat-containing exon in one organism represents an ancestral state that does not need to be broken up by introns. If one does not hold to the strict introns-early position, and instead allows that all or most intron positions have arisen recently (as is obvious from the data), then the inversed repeats may simply have arisen recently in introns, more commonly than in exons. And finally, if a long exon arose by deletion of intervening introns, yet was still able to evolve inversed-repeats, then this suggests yet again that the inversed repeats do not have to be ancient. So, any way one looks at it, one must allow that inversed repeats can arise recently in introns and exons, so that there is no need to propose additionally that the specific pattern of inversed repeats is ancient.

More importantly, the likelihood that all nearly all spliceosomal intron positions are recent (i.e., subsequent to eukaryotes) in origin in no way contradicts Forsdyke's major suggestion that the inversed-repeats exist to stimulate recombination. If there is indeed selection favoring the genesis of inversed-repeats for the sake of recombinational pairing, then such repeats will arise in introns, exons, intronless bacterial genes, in inter-genic spacers, and in repeat DNA (wouldn't this be the best way to do it -- have a self-replicating repetitive family bearing inverted repeats, that could spread throughout the genome?). Again, as Forsdyke argues in his recent paper, if constraints on sequences are lower in introns than exons, the inversed repeats will be more likely to arise and be maintained there, rather than in the exons.

Although the "kissing" theory doesn't shed light on the origin of introns, it does point to a general feature of genomes (with or without introns) that requires an explanation, probably a very interesting one.

JM: He has done computer simulations with random DNA, and he reviews this work on pages 273-288. It did not involve a full-length run of DNA (10^30 nts), but enough simulated DNA was used to search for genes and other things.

From Dave Oldridge: Nope. computer simulation just won't cut it here. His simulation assumes too much of the theory is true to be a true test. Computer emulations can sometimes help us disprove a theory or support it, but I want to see physical (in this case biological) tests.

All that happened was that a program that Senapathy wrote (or had written) behaved according to his expectations. He may (I won't concede that without seeing the whole program) have shown that his thesis is possible he has not shown yet that it is probable.

And recent work on self-replicating molecules points in a somewhat different direction. I can't remember the exact reference, but I'm sure someone will come up with it. in the past year I read in Scientific American of an experiment with some very simple self-replicating molecules that showed that, even at this level, mutation and selection can occur. It begins to seem quite likely that DNA itself is the product of an evolution.

From Keith Robison: (quoting JM ) However, Senapathy provides much detail, based on his own research over many years, for the most important parts of the theory -- the formation of genes from random DNA.

Keith: Jeff, you have never answered my arguments from information theory about how absurd Senapathy's theory of gene formation is. To wit: the probability of assembling a gene is not aided by splicing, and in any case the amount of information needed to build a modern organism rules out finding a working genome in Senapathy's pond.

JM: Well, I cannot say that I accept your information theory argument as being relevant to this. I've got Yockey's book from the library (it's sitting right here by my feet, as you say), and if you tell me exactly where to look for any parts that are relevant to finding split genes , then maybe I can get past that barrier. Anyway, Yockey doesn't like the pond idea AT ALL -- so how does he figure things got started?

Keith: Jeff, it does not matter if the genes are split or not. In order for a genetic message to be readable, there must be a way of decoding the message without knowing the message in advance. Senapathy's stuff works only because he is looking for known messages -- he has provided no mechanism for decoding the random sequence into intelligible messages. We've been over all this before -- stop codons are not splice sites!

JM: OK, I can forget about the stop codons. But eukaryote genes are random (according to Dr. Senapathy's research), and the slice signals are not chosen by Senapathy -- they are the result of chemical processes that just happen to work correctly, hence we are alive to examine them. If there a different set of splice signals, we would be contemplating those instead. You once said I had proved you did not exist. It seems to me you are now proving that life does not exist. It does not matter who says it is improbable to find genes in DNA -- they are there, and the eukaryote DNA they are found in is random. I must be missing something in your argument. Could you try again, or refer me directly to Yockey's discussion about this?

JM: (continuing) I cannot say that I accept your information theory argument as being relevant to this. I've got Yockey's book from the library (it's sitting right here by my feet, as you say), and if you tell me exactly where to look for any parts that are relevant to finding split genes , then maybe I can get past that barrier.

Keith: The real beauty of the IT approach in this situation is that splicing is irrelevant . The IT estimate is the probability of finding such a pattern at random after any arbitrary sequence of deterministic transformations. I.e., it does not matter if you invert the whole sequence, translate them according to a table, etc., so long as you follow deterministic rules (this is the whole reason Shannon invented the theory after all -- to predict the behavior of messages under compression, encryption, etc.).

So Senapathy's theory of the emergence of modern genes and genomes from random soup is completely preposterous from a statistical standpoint.

JM: As for the odds of finding an entire organism, then I will "reuse" your argument that the odds of you or me existing is so great that we cannot exist. OK, so that doesn't strictly apply -- I just couldn't help myself. How about this: the state of going from non-life to life could have been multi-step. That is, you allow "selection" to operate in evolution (hence you can ADD the probabilities of each step), and there very well could have been some form of "selection" in the pond (although not at a living level) that caused the results of chemical process in the pond to migrate from pure garbage to viability. That is, the results of certain chemical processes could have contributed to those processes remaining around. As I've said before, this part of the new theory may require your imagination, and you refuse to let yourself think along these lines. (This has nothing to do with Darwin. -- Keith: how you YOU explain the origins of life?)

Keith: Good question -- I don't really try to. Clearly there was a transition from clearly-not-life to clearly-life, probably going through a long phase of somewhere-in-between. However, that early clearly-life stage was much simpler than the common ancestor of all living things (CAoALT), and the CAoALT was a descendant.

From Keith Robison: (via email) I was saying is not that life is improbable (we'll get to that), but that the Information Theory says that Senapathy's scenario is improbable -- modern genes will not spring in toto from as small a pool of random sequence as he posits (I hope you did get a good laugh out of his claim that the probability of finding 1 gene = the probability of finding all genes).

JM: I know he discussed the probabilities of finding one gene and of finding any gene, but I don't recall that he equated "one to all." Where was that? Can you give me a page, date, or other reference?

Keith: Page 288 -- you can't miss it :-)

JM: Once again, I have no trouble finding an explanation. Regarding: "The probability for finding millions of genes is the same as the probability for finding one gene." Just read a little further. In the last paragraph on the following page, you will find: "if one typical gene could probabilistically occur in the USP, then almost any gene for any particular biochemical function . would occur in the USP." So, since he can find one "typical" gene in the USP and since there are millions of other genes with similar characteristics to that typical gene (needed for the "multitudes of unique biochemical functions" ), he can find any of those other genes ( "almost any gene for any particular biochemical function" ), all millions of them, in the same USP each with the same probability. He is NOT saying P(1) = 10^6 * P(1), as you alleged. You saw a patently ridiculous statement there -- because that was what you were looking for. I agree that his statement on page 288 is confusing and it could have been worded more clearly, but you refused to see his true meaning, and instead were complaining about style.

Keith: No, it's typical Senapathian sloppiness. Think about how you would really go about calculating the probability of finding every known gene given the probability of finding one gene. This is just simple statistics (a place where Dr. S. tends to slip up frequently).

JM: OK, I have been thinking.

1. On page 288, he is not talking about "every known gene." He is clearly discussing any one gene of a group of millions of genes that are similar to a "typical" gene. That was the subject of my previous message.

2. Given any gene ("g") that is similar in length and exon/intron makeup to the given "typical" gene "t" (specifically, it has no exons that are longer than those found in the typical gene), then I would compute the probability of finding that gene as follows:

That's what he's doing on page 288.

3. The probability of finding every one of those known genes is:

where "n" is the number of known genes. But I don't see what sense that makes. For a certain seed cell, I only have to find the 20,000 or so particular genes needed. So:

Even this is like asking me to compute the probability of "Keith" or "Jeff" being conceived. You taught me that doing that is senseless, and this is, too, for the same reason. So let's back up and look at the problem.

Please read again page 287 which leads up to his "millions of genes" section. He is not computing probabilities there, he is computing the amount of DNA needed to find (with a very high likelihood) a typical gene and also find one with reasonable intron lengths. Once he computes how much DNA is needed, then you will find ANY and ALL such typical (or shorter) genes in that length of DNA.

This is analogous to computing how long of a string of random letters would be needed to find a certain three-letter word ("NOT") with a very high likelihood (you need the expected mean length times six -- about 10^5 characters), and then saying you can find ANY three-letter word ("AAA" through "ZZZ") in that random string. (Reference: pages 223-225 and Chapter 7 footnotes 9 & 10.) There is NOTHING wrong with that logic or that math. He computes the amount of DNA needed using the same math as for the three-letter word example. Finding any long (600 nt) exon seems ridiculous, but, given enough random DNA, it is not.

BTW, not all 600 nt runs would be a valid exon, so not all combinations of 600 nt will ever need to be found. That is what you mean, I assume, by "every known" gene -- only the valid, known ones.

Although you point out that there are a few exons longer than 600 nt, most are also much shorter than 600 nt. Why there are a few longer ones is a good question, but that one needs to be answered in a different discussion (there are plenty of exceptional cases allowed to evolution, too). For the time being, I think Senapathy is being generous by using 600 nt so often rather than a smaller value.

Now, do you want me to find an entire genome in the random DNA? OK, I still can, although with a lower degree of certainty. But, we're dealing with likelihoods so close to one, that even raising them to the 20,000 power (the number of genes in an average genome) does not hurt much, and you can use more DNA very easily because there is plenty more DNA available. The probabilities of finding all genes needed for one genome will be smaller by 2x10^5 power, but look at the numbers on pages 287-288:

Keith: (reprise) No, it's typical Senapathian sloppiness.

JM: So, what problem do you find here? Can you be specific and show me the sloppiness on pages 287-288? I agree with you that the wording is confusing, but let's look at his meaning and his math, which I find very clear and understandable. You obviously don't, but I don't understand why, and I'd like to know.

Keith, this is deja vu -- we went through this process on point mutations, and you ended up agreeing that the Senapathy/Mattox math was correct, saying it was the model that must be bogus. If you recall, it was YOU who originally posted some sloppy math while saying Senapathy's math was wrong. I'd just like to resolve this one, too.

From Keith Robison: (via email) I shouldn't have done this -- neither of us seems to be very good at convincing the other, and we can just go round forever.

I think Senapathy's statement depends on how you define "every gene" -- and I would say he is clearly arguing that every known gene can be found in a pond of the size he is stating. Since neither of us can jump in his head, I don't see a real resolution.

JM: (reprise) He computes the amount of DNA needed using the same math as for the three-letter word example. Finding any long (600 nt) exon seems ridiculous, but, given enough random DNA, it is not.

Keith: True -- but you need a lot more then Senapathy's pond has. If I remember my calculations correctly, his pond contains every possible 50-mer or so -- but certainly only a small fraction of the 600-mers.

JM: (reprise) For the time being, I think Senapathy is being generous by using 600 nt so often rather than a smaller value.

Keith: Of course, the key point here is that the distribution of exon lengths looks nothing like what Senapathy claims (it's not a simple exponential dist). So there's a lot more of a problem than the long exons.

There really isn't an average genome, but rather they are stratified into a few size categories. There are many, MANY, genomes with more like 50,000-100,000 genes in them.

Okay, perhaps if S could get the calculation right for the probability of one gene, then maybe he's not completely off base. But, as I have said before, his calculation (p.287) is so grossly flawed as to make it meaningless.

  1. He neglects stop codons in the estimate
  2. More importantly, he grossly underestimates the information content of a protein (as I've pointed out before).
  3. The optimization approach he describes at the bottom is completely out of left field -- it has no basis in biology.

Anyway, we could both go on at this forever. I think your time would be much more productively spent if you possibly did some of the following:


C4b-Binding Protein

Marcin Okrój , Anna M. Blom , in The Complement FactsBook , 2018

Protein Modules

C4BPα
1–48 Signal sequence exon 2
49–109(Uniprot)CCP domain 1exon 3
111–172(Uniprot)CCP domain 2exon 4/5
173–236(Uniprot)CCP domain 3exon 6
237–296(Uniprot)CCP domain 4exon 7
297–362(Uniprot)CCP domain 5exon 8
363–424(Uniprot)CCP domain 6exon 9
425–482(Uniprot)CCP domain 7exon 10
483–540(Uniprot)CCP domain 8exon 11
541–597 C-terminal oligomerisation domainexon 12
C4BPβ
1–17 Signal sequenceexon 1
21–78(Uniprot)CCP domain 1exon 1/2
79–136(Uniprot)CCP domain 2exon 3
137–193(Uniprot)CCP domain 3exon 4/5
193–252 C-terminal oligomerisation domainexon 5/6

Biologists find invasive snails using new DNA-detection technique

Biologists led by the University of Iowa used a special technique called eDNA to discover an invasive species of tiny snails in streams in central Pennsylvania where the snails' presence had been unknown. The invasive New Zealand mud snail has spread to the Eastern Seaboard after arriving in the western United States decades ago. Credit: Edward Levri, Pennsylvania State University-Altoona

Invasive species, beware: Your days of hiding may be ending.

Biologists led by the University of Iowa discovered the presence of the invasive New Zealand mud snail by detecting their DNA in waters they were inhabiting incognito. The researchers employed a technique called environmental DNA (eDNA) to reveal the snails' existence, showing the method can be used to detect and control new, unknown incursions by the snail and other invasive species.

"eDNA has been used successfully with other aquatic organisms, but this is the first time it's been applied to detect a new invasive population of these snails, which are a destructive invasive species in fresh waters around the world," says Maurine Neiman, associate professor in the Department of Biology and the study's co-author. "eDNA can be used to find organisms at really early stages of invasion, so it can detect a population even when there are so few of the organisms that traditional methods would never find them."

The biologists traveled to central Pennsylvania seeking evidence of the presence of the mud snail, which for decades has been spreading in fresh waters in the continental United States, beginning in the Northwest, moving to the Great Lakes, and now migrating along the Eastern Seaboard. The tiny aquatic snails' population densities can balloon to more than 500,000 individuals in a square yard, covering the water bottom and crowding out native species.

The researchers collected samples from eight sites spread across six rivers in the Susquehanna River watershed, which feeds into Chesapeake Bay and the Mid-Atlantic watershed. Six of the sites had no reported cases of the mud snail, despite physical surveys, while the other two locations had not been studied.

The researchers used the eDNA technique to look for DNA the snails would leave as tracers in sloughed-off skin cells or bodily waste. They discovered the snails were there, after all: The eDNA results confirmed the mud snails were at one site where none had been detected previously, and were likely at low population levels at other sites as well.

"This study presents an important step forward in demonstrating that eDNA can be successfully applied to detect new P. antipodarum invasions and will allow us to more accurately track and potentially halt ongoing range expansion of this destructive invasive species," wrote James Woodell, a research support technician at University of Hawaii at Mānoa who performed the research while a master's student in biology at Iowa and is the study's corresponding author.

The eDNA technique was developed less than a decade ago. It has been used to ferret out invasive species, including fish, frogs, and crustaceans, in aquatic ecosystems. For this study, the biologists refined the filtering protocols from an existing eDNA sampling system for mud snail detection and tested it for the first time in the field.

The study, "Matching a snail's pace: Successful use of environmental DNA techniques to detect early stages of invasion by the destructive New Zealand mud snail," was published online on June 1 in the journal Biological Invasions.


Watch the video: Introns vs Exons (January 2023).