Posted by: Kay at Suicyte | September 6, 2007

Up to 95% of junk is just junk! (updated)

I have noticed that during the last few weeks, after a transient period of high activity, my blog readership is on its way down to the 5-6 hits per day that I know from the first two months. Thus, it is probably time to do something about it, which means: blog about a highly controversial topic. My pick for today is an old favorite of mine: junk DNA. Without any quotes. And, while we are at it, I’ll do alternative splicing, too.

In a way this post is related to Jonathan Eisen’s recent rant against “adaptionomics“, which he defines as a vice common among genomics researchers: the finding of something weird in the genome leads to the (unwarranted) conclusion that this particular feature must confer a selective advantage. I don’t have a fancy name for the vice I am referring to, maybe you can come up with something suitable. I am also talking about an unwarranted conclusion, starting from the observation that something exists and leading to the claim that is must be useful for something.

To avoid being too vague here, let us just focus on two examples. First, the observation that a lot of intronic and intergenic DNA is actually transcribed (->pervasive transcription) is widely interpreted as a proof, or at least as a strong hint that these DNA regions are not just junk but probably do something important. The second example is alternative splicing. The fact that for many genes more than one transcript variant exists (arising by alternative use of exons) is widely interpreted as biologically very important and is even considered by many as a major factor contributing to the complexity of the so-called “higher eukaryotes”.

Pervasive transcription

I have no problem admitting that there are some important things hidden within the introns and intergenic regions. However, unless I am provided with some solid piece of evidence, I assume this to be an exception rather than the rule. First of all, I like to start from the null hypothesis that anything existing is just random noise. In my opinion, the burden of proof is on those who claim a function or importance. In the case of non-coding DNA, most people will cite the observation that what we once considered to be junk DNA is actually transcribed. I find this argument rather unconvincing. A while ago, we had only junk DNA, now we have also junk transcripts. So what? I can’t remember exactly what percentage of the human genome is claimed to be transcribed, but here is my alternative explanation for pervasive transcription.

First, there is no reason to be surprised by transcription in the introns. They must be transcribed to result in pre-mRNA, which is then spliced to mature mRNA. According to the traditional model, intronic remnants of pre-mRNA are rapidly degraded and only mature mRNA is exported to the cytoplasm. Nevertheless, if your detection method is sensitive enough, you can expect to find lots of intronic RNA – it must be present in the nucleus, but we also know that splicing fidelity (see below) and quality control in the export mechanism are far from perfect. Thus, it should not come as a big surprise to also detect intronic RNA in the cytoplasm.

Transcription in the intergenic regions is not so easy to explain, but there are at least two contributing mechanisms that do not require any functional importance of junk DNA. Several genes are known to have ridiculously long 3′ UTR regions. Only for few genes, this property is well documented, but I have seen lots of examples during the long years when I had to manually analyze SAGE tag assignments. Even if genes have a perfectly viable polyadenylation signal close to the stop codon, the fidelity of transcription termination is not perfect and you frequently see alternative polyA sites at far downstream positions. My experience with analyzing this kind of data pushes me to the somewhat heretic suggestion that a large portion of intergenic transcription is due to very long 3′-UTR tails. I don’t think that these long tails are generally important – some of them might be, but in most cases it is just a polyA signal that has been missed.

In addition, there is another – more speculative – mechanism in place, which might contributes to measurable quantities of intergenic RNA: spurious promoters. Everybody who has tried to predict promoters from genomic sequences (without using homology data!) knows that this is not an easy task. Many of the recognition sites for transcription factors are too short to be significant. As a consequence, many consensus TF binding sites can be expected to occur far from established or predicted promoters. These spurious sites are thought to be inactive, as most meaningfully transcribed genes contain a combination of multiple TF binding sites in their promoter region. On the other hand, who guarantees that none of the spurious sites is able to initiate a certain level of spurious transcription? It might not reach the transcription levels of ‘proper’ genes, but may be sufficient to be detectable by tiling microarrays. To avoid any misunderstanding: I am not suggesting that the presence of a TF binding site far away from any known genes implies that there is something that needs to be transcribed. I am talking about purely random TF binding sites with consensus sequences that are so degenerate that you would expected thousands of copies throughout the genome.

I don’t have solid data on the prevalence of either mechanism, but I would envisage that a combination of all mentioned ‘infidelity’ events could account for a majority of the pervasive transcription that has been reported – without requiring that any of it is ‘functional’ or important in any other way.

Alternative splicing

There are a few cases where alternative use of exons, leading to alternatively spliced transcripts, is known to be of crucial importance for an organism. One notable example are the neurexins, which exist in hundreds of different splice forms, each one having a different recognition specificity. The analysis of EST data, and more recently also the analysis of exon-array data, has lead to the identification of a large body of alternative transcripts. Nowadays, alternative splicing is considered to be the rule rather than the exception. And obviously, the existence of alternative transcripts is often connected to claims of great biological importance.

Again, I am not convinced. Have you ever tried to predict a splicing pattern for a gene from the genomic sequence alone? Without resorting to homology information? Without using coding potential? I find this very hard to do (and I consider myself more intelligent than your average spliceosome). Unfortunately, the human splicing machinery has no access to the homologous mouse sequence, nor can it read ORF information or even calculate Hidden Markov models. Even the best gene prediction programs (which do have access to these external resources) make lots of mistakes, and this is what I also expect from the biological splicing machinery. I consider a large proportion of what appears to be ‘alternative splicing’ as petty ‘mis-splicing’ events. The idea here is that cells can tolerate a lot of mis-splicing, as long as enough of the proper transcript is made, and none of the ‘alternative’ transcripts is harmful. Many of the mis-spliced transcripts will be short-lived anyway, due to NMD and other processes. In other cases, the encoded proteins might be short-lived, and in yet other cases the ‘alternative’ protein version just doesn’t hurt. In my alternative model, I would expect a perfect splicing fidelity only in cases where the alternatives are clearly detrimental.

This discussion leaves one important question unanswered: how can we tell the meaningful alternative splicing events from the mis-splicing chaff ? I am afraid this can be very difficult. It clearly helps if an alternative splicing mode is conserved over a broad range of organisms. However, this approach fails if a majority of all possible exon combinations are observable (we need a splicing HapMap!) – in this case a multi-species conservation for some of them is to be expected, even under the assumption of a purely random process.


I have proposed a few rather boring explanations for observations that are generally considered to be exciting. The underlying assumption is that nature is not perfect, and a lot of naturally occurring biosynthetic processes have only the degree of fidelity that is absolutely required. Even in situations where high fidelity is crucial, it is not guaranteed that this fidelity resides in the biosynthetic process. We know from engineering (think silicon chips or LCD panels) that a 100% perfect product sometimes requires a prohibitively expensive production process. It is often preferable to go for 90% perfection and let a quality control step take care of the remaining 10%. I would guess that nature works like this, too. We already know some of the QC steps, with probably more waiting to be discovered. In any case, we should not be too surprised if we encounter something weird in the cell. Granted, it could be something interesting, but maybe it is just a piece of junk that hasn’t been removed yet by QC.


I have just found a blog posting at Genomicron that gives a nice historical overview of junk DNA and discusses some pros and cons. Highly recommended!


  1. I think it’s inevitable that there will be some transcription of just about everything in the genome. There’s an information cost to identifying which parts of the genome to transcribe and which not. Perfect identification would mean an infinite cost. Therefore there must not be perfect shutoff of all non-useful regions. Therefore the only interesting question is how close to perfect the identification is (“We’ve already established what you are, now we’re just haggling over the price”). The “correct” investment in transcription identification is the cost of aberrant transcription. Completely useless RNA may not cause a lot of problems if it floats around, so it may not be worth investing a huge amount of care in shutting them off.

    Another point — and I don’t know if this is formally true (though I would be interested in a formal treatment of it), but it makes intuitive sense to me — is that a two- (or more) tier screening may end up with equal accuracy at lower cost. My analogy here is screening for diseases — rather than have a super-accurate first-line test, you have a reasonably accurate first line test and if someone turns up suspicious on that, you invest the extra cost in a second, different test. One level of screening might be permitting transcriptional starts — another might be degrading nonsense RNA — a third might be translation — and so on.

    In other words, I would have been more surprised if nonsense DNA was perfectly not transcribed at all.

  2. … and I just re-read your final paragraph and realized you were saying the same thing as me, jsut more concisely. Sorry about that.

  3. Correct me if I’m wrong, but a huge chunk of microRNAs are encoded within intronic regions and processed from spliced products, which, regardless of the actual number, is very significant.

    Also, can you really call regions containing 3′ UTRs of genes “intragenic?”

  4. Ian, as you can see in my final paragraph, I totally agree with you!

    Nosugrefneb (Ben?) : miRNAs are an interesting topic which I will cover in a separate post. I don’t know how many of them are intron-encoded, but I am sure that there are some. If I remember correctly, there were some reports that some genes have very short introns that can directly act as miRNAs. There are certainly also conventional miRNA contained in intronic region. In any case, I don’t see a problem here: I am not claiming that all introns and all intergenic regions are useless. There are several proven examples where this is clearly not the case, and I expect more to come. I am only opposing the current trend of “throwing out the baby with the bath water” by claiming that everything we once thought to be junk is now considered functional (-> the ENCODE project).

    And, no – one would not call a 3′-UTR region ‘intergenic’ (I gues this was your question). However, if a gene is assumed to have a short 3′-UTR, everything that follows in considered intergenic. If I am now claiming that the real UTR (or at least a variant UTR) is much longer, some of the formerly ‘intergenic’ region would now figure as a 3′-UTR. This is why I think that long 3’UTRs can explain some of the observed transcription in intergenic regions.

  5. I’m not in the trade, so a more simple question. What role does redundancy play? i.e excess genetic material that does the same as another piece but isn’t needed unless the other piece falls off.? Evolution, being a mindless thing, might happen because there is a lot of available sources of happenstance in the genetics.

  6. There are probably many analogies to this. Also in protein interactions there are likely many binding events that are “spurious”, that have a neutral or nearly neutral impact on fitness. As Ian says it is inevitable that in any identification problem in the cell there will be miss-identifications. I would just add that some of this miss-identification can play a role in evolution by increasing the capacity to generate phenotypic diversity. In this sense, although it has no current role, a fraction of these spurious events can be useful for the species and therefore not junk :).
    See Wagner for a more detailed argument:
    Wagner, A. Robustness, Neutrality, and Evolvability. FEBS Lett. 579, 1772-1778 (2005)

  7. What role does redundancy play? i.e excess genetic material that does the same as another piece but isn’t needed unless the other piece falls off.?

    Gene duplication is an important source of evolutionary change (there was just a paper about this, but I don’t have it at my fingertips and I have to get my kids ready for school in 3 minutes). However (and I think this applies to Pedro’s comment as well) I think it’s still “junk” for the individual at any one time. The fact that it’s useful for one’s great-to-the-nth offspring doesn’t mean it’s useful for you. Similarly, I don’t think it’s legitimate to look at the population/species rather than the individual, since the latter is either always, or almost always, the unit of selection.

    Obviously there are questions of definition. Remember that “junk” is not the technical term (do a PubMed search for “Junk DNA” and you’ll only find a handful of examples, most of which are deprecating the term) but rather is the press release and lazy-journalist term (again, mostly used to explain why it’s a bad term).

  8. Quick (pedantic!) note on terminology: “promoter” is different from “cis-regulatory module”. Most transcription factors in eukaryotic organisms do not directly promote transcription, but rather interact with cofactors that *then* promote transcription at a promoter. Promoters are associated with a transcription start site.

    Although your end point is entirely correct: there seem to be a *lot* of promoters and that’s probably the source of a lot of this transcription.

    Great blog post, btw!


  9. Kay one of those 5-6 daily hits, comes from me πŸ™‚
    Coincidence, I was just reading Gerstein et al. review “What is a gene, post-ENCODE? History and updated definition” [ ]!
    I am being reminded of my days in an IT company where I was a part of team building a genome storage and analysis tool. We (somehow) made it work. Now the testers came up with bug lists and we started fixing them (patches), clients re(de)fined their requirements (more patches), we re(de)fined many functions (more patches) … in the end… it looked like a functional mess πŸ™‚

  10. Animesh, interesting analogy. It seems that our intelligent designer (hah, suicyte goes ID!) also had to work for difficult customers, using unclear specifications that were changes several times. Most likely we are currently in beta test. πŸ™‚

  11. I don’t know enough probably sort of assumptions to be good enough for bacteria, assuming some sort of difuse random medium where chaff sort of floats around.

    In higher eukaryotes, there is an estimate of as little as 30% water in the nucleus. Seems much more packed than one might expect. I mean having even 10% of the dna and proteins in the nucleus being random chaff seems a little too unorganized for my taste.

  12. […] Suicyte Note expresses an opinion that I share very much. It’s just an opinion, but one sorely overlooked. […]

  13. Kay, fantastic blog I really enjoyed reading this, great to see the community come together for a discussion.

    I have been scouring the literature regarding the now ‘hot topic’ of ncRNA and especially anti-sense transcription in the genome and must say that I believe there is something more to all this than meets the eye. It might be the case that the transcription initiation machinery will take advantage of many DNA sequences to initiate some sort of transcription, whether it be weak or strong and that this is dependent on the availability of the DNA at any particular time. For example, it can be considered that when a gene is being actively transcribed and the first round is complete and the chromatin remodelled to allow easy access of proteins to the DNA, that any sequences that are able to be bound by the GTFs will be and that if there is a sufficient time span for inintiation to occur and a stable elongation complex generated that processive elongation will proceed. If the ‘cryptic’ elongation complexes are transcribing in the same direction as the promoter traffic, then cryptic transcripts will be generated, but evidence suggests that transcription from the gene will not be impeded. However, if transcription occurs on the anti-sense strand, as the literature shows must happen, then we can imagine that interaction of two elongation complexes (the ‘natural’ promoter driven polymerase and the ‘cryptic’ antisense transcribing polymerase) that there will be interruption of transcription and thus gene expression. Perhaps this serves as a method to control expression from a particular coding region. So instead of an RNAi type interference from a nc antisense RNA we may see a more mechanical disruption through an inability to transcribe past another polymerase. Thus offering an indirect function for pervasive transcription.

    (Sorry if that was long-winded) and again I must congratulate you on this blog!

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s


%d bloggers like this: