What is the main difference between a genome sequencing project and the structural genomics initiative? Well, one deals with sequences and the other with structures, but this is not what I have in mind. For me, the fundamental difference is that a genome sequencing project has a point where you can call it ‘finished’, while this is not really true for structural genomics (and a number of other large-scale efforts). A very important aspect here is that a projects in the ‘finished’ state is very different from being 99% finished. Only a genuinely complete genome sequence allows conclusions on what is not present in the genome. For many applications, these conclusions can be equally important to what is present.
This issue is obviously important if you are interested in biological pathways: for many organisms, in particular microbes, we know much more about the metabolic pathways used from looking at their genome sequence than from directly studying the organism. In a strict sense, inferring pathways from genome sequence is a mere prediction, but typically a very strong one, which is unlikely to be overturned by direct biochemical analysis. Everybody working in this field knows that for predicting the main metabolic reactions of an unknown organisms, it is not only important to know which enzymes are there, but also which ones are not.
A second area where knowledge on absent genes can be crucial, is close to my own research area: the prediction of the function of unknown gene products. One very promising approach compares the presence and absence of genes (or rather: homology groups) over a wide range of different organisms. The idea is to search for characterized genes that have the same phyletic distribution, i.e. are always present in organisms that also contain the gene of interest, but are absent in all the others. This conserved co-occurrence is interpreted as suggesting that the genes work in a common pathway, or at least a common biological process. This “phylogenetic profiling” method has been first formally published in 1999, but has certainly been applied by many researchers (including myself) long before that date. The more organisms have their genomes completely sequenced (and the more diverse they are) the more useful will this method become. Thus, I am convinced that the best days of phylogenetic profiling are still to come. Obviously, incompletely sequence genomes are almost useless for this application.
There are probably many more applications that make use of completely sequence genomes. I haven’t even talked about applications of other complete high-throughput approaches – this will have to wait for more of those approaches to become available. It would certainly be interesting to analyze sets of complete protein interaction data, which also would give you reliable information on what proteins do not interact. I am not sure if this will ever be feasible, though.
Structural genomics, when applied to all organisms, is very different. It cannot be expected to ever reach a point of completion, at least not a reliable one. It could be envisaged to have a complete set of structures for selected organisms, though. But even in this case, it is hard to imagine what useful information can be gleaned from a complete structural complement (how do they call it? structureome?) . I have seen papers in the context of structural genomics that talk about the percentage of different folds being used by different organisms. To me, this looks a bit like collecting stamps. Is it really useful to know that organism X doesn’t have a single instance of the beta-grasp fold? Well, maybe I am just ignorant and don’t understand the important questions in structural biology. So, if you have an idea, please let me know.
At the moment, I would rather support large scale projects that really take a benefit from being complete, i.e. where 100% completeness is much more than twice the value of 50% completeness. Please, don’t get me wrong: I am not saying that this structural genomics initiative is not useful! It has produced a lot of interesting structures, and even I (as a sequence- rather than a structure-person) have taken ample profit from the structural data that has been generated.
P.S. this topic has concerned me for a while, but this post was prompted by the recent claim that the Cyanidioschyzon merolae genome project has produced the first complete eukaryotic genome sequence. This rather surprising claim has also been discussed in Steven Salzberg’s blog.