I have noticed that during the last few weeks, after a transient period of high activity, my blog readership is on its way down to the 5-6 hits per day that I know from the first two months. Thus, it is probably time to do something about it, which means: blog about a highly controversial topic. My pick for today is an old favorite of mine: junk DNA. Without any quotes. And, while we are at it, I’ll do alternative splicing, too.
In a way this post is related to Jonathan Eisen’s recent rant against “adaptionomics“, which he defines as a vice common among genomics researchers: the finding of something weird in the genome leads to the (unwarranted) conclusion that this particular feature must confer a selective advantage. I don’t have a fancy name for the vice I am referring to, maybe you can come up with something suitable. I am also talking about an unwarranted conclusion, starting from the observation that something exists and leading to the claim that is must be useful for something.
To avoid being too vague here, let us just focus on two examples. First, the observation that a lot of intronic and intergenic DNA is actually transcribed (->pervasive transcription) is widely interpreted as a proof, or at least as a strong hint that these DNA regions are not just junk but probably do something important. The second example is alternative splicing. The fact that for many genes more than one transcript variant exists (arising by alternative use of exons) is widely interpreted as biologically very important and is even considered by many as a major factor contributing to the complexity of the so-called “higher eukaryotes”.
I have no problem admitting that there are some important things hidden within the introns and intergenic regions. However, unless I am provided with some solid piece of evidence, I assume this to be an exception rather than the rule. First of all, I like to start from the null hypothesis that anything existing is just random noise. In my opinion, the burden of proof is on those who claim a function or importance. In the case of non-coding DNA, most people will cite the observation that what we once considered to be junk DNA is actually transcribed. I find this argument rather unconvincing. A while ago, we had only junk DNA, now we have also junk transcripts. So what? I can’t remember exactly what percentage of the human genome is claimed to be transcribed, but here is my alternative explanation for pervasive transcription.
First, there is no reason to be surprised by transcription in the introns. They must be transcribed to result in pre-mRNA, which is then spliced to mature mRNA. According to the traditional model, intronic remnants of pre-mRNA are rapidly degraded and only mature mRNA is exported to the cytoplasm. Nevertheless, if your detection method is sensitive enough, you can expect to find lots of intronic RNA – it must be present in the nucleus, but we also know that splicing fidelity (see below) and quality control in the export mechanism are far from perfect. Thus, it should not come as a big surprise to also detect intronic RNA in the cytoplasm.
Transcription in the intergenic regions is not so easy to explain, but there are at least two contributing mechanisms that do not require any functional importance of junk DNA. Several genes are known to have ridiculously long 3′ UTR regions. Only for few genes, this property is well documented, but I have seen lots of examples during the long years when I had to manually analyze SAGE tag assignments. Even if genes have a perfectly viable polyadenylation signal close to the stop codon, the fidelity of transcription termination is not perfect and you frequently see alternative polyA sites at far downstream positions. My experience with analyzing this kind of data pushes me to the somewhat heretic suggestion that a large portion of intergenic transcription is due to very long 3′-UTR tails. I don’t think that these long tails are generally important – some of them might be, but in most cases it is just a polyA signal that has been missed.
In addition, there is another – more speculative – mechanism in place, which might contributes to measurable quantities of intergenic RNA: spurious promoters. Everybody who has tried to predict promoters from genomic sequences (without using homology data!) knows that this is not an easy task. Many of the recognition sites for transcription factors are too short to be significant. As a consequence, many consensus TF binding sites can be expected to occur far from established or predicted promoters. These spurious sites are thought to be inactive, as most meaningfully transcribed genes contain a combination of multiple TF binding sites in their promoter region. On the other hand, who guarantees that none of the spurious sites is able to initiate a certain level of spurious transcription? It might not reach the transcription levels of ‘proper’ genes, but may be sufficient to be detectable by tiling microarrays. To avoid any misunderstanding: I am not suggesting that the presence of a TF binding site far away from any known genes implies that there is something that needs to be transcribed. I am talking about purely random TF binding sites with consensus sequences that are so degenerate that you would expected thousands of copies throughout the genome.
I don’t have solid data on the prevalence of either mechanism, but I would envisage that a combination of all mentioned ‘infidelity’ events could account for a majority of the pervasive transcription that has been reported – without requiring that any of it is ‘functional’ or important in any other way.
There are a few cases where alternative use of exons, leading to alternatively spliced transcripts, is known to be of crucial importance for an organism. One notable example are the neurexins, which exist in hundreds of different splice forms, each one having a different recognition specificity. The analysis of EST data, and more recently also the analysis of exon-array data, has lead to the identification of a large body of alternative transcripts. Nowadays, alternative splicing is considered to be the rule rather than the exception. And obviously, the existence of alternative transcripts is often connected to claims of great biological importance.
Again, I am not convinced. Have you ever tried to predict a splicing pattern for a gene from the genomic sequence alone? Without resorting to homology information? Without using coding potential? I find this very hard to do (and I consider myself more intelligent than your average spliceosome). Unfortunately, the human splicing machinery has no access to the homologous mouse sequence, nor can it read ORF information or even calculate Hidden Markov models. Even the best gene prediction programs (which do have access to these external resources) make lots of mistakes, and this is what I also expect from the biological splicing machinery. I consider a large proportion of what appears to be ‘alternative splicing’ as petty ‘mis-splicing’ events. The idea here is that cells can tolerate a lot of mis-splicing, as long as enough of the proper transcript is made, and none of the ‘alternative’ transcripts is harmful. Many of the mis-spliced transcripts will be short-lived anyway, due to NMD and other processes. In other cases, the encoded proteins might be short-lived, and in yet other cases the ‘alternative’ protein version just doesn’t hurt. In my alternative model, I would expect a perfect splicing fidelity only in cases where the alternatives are clearly detrimental.
This discussion leaves one important question unanswered: how can we tell the meaningful alternative splicing events from the mis-splicing chaff ? I am afraid this can be very difficult. It clearly helps if an alternative splicing mode is conserved over a broad range of organisms. However, this approach fails if a majority of all possible exon combinations are observable (we need a splicing HapMap!) – in this case a multi-species conservation for some of them is to be expected, even under the assumption of a purely random process.
I have proposed a few rather boring explanations for observations that are generally considered to be exciting. The underlying assumption is that nature is not perfect, and a lot of naturally occurring biosynthetic processes have only the degree of fidelity that is absolutely required. Even in situations where high fidelity is crucial, it is not guaranteed that this fidelity resides in the biosynthetic process. We know from engineering (think silicon chips or LCD panels) that a 100% perfect product sometimes requires a prohibitively expensive production process. It is often preferable to go for 90% perfection and let a quality control step take care of the remaining 10%. I would guess that nature works like this, too. We already know some of the QC steps, with probably more waiting to be discovered. In any case, we should not be too surprised if we encounter something weird in the cell. Granted, it could be something interesting, but maybe it is just a piece of junk that hasn’t been removed yet by QC.
I have just found a blog posting at Genomicron that gives a nice historical overview of junk DNA and discusses some pros and cons. Highly recommended!