Posted by: Kay at Suicyte | August 19, 2007

Why genome projects make more sense than structural genomics

What is the main difference between a genome sequencing project and the structural genomics initiative? Well, one deals with sequences and the other with structures, but this is not what I have in mind. For me, the fundamental difference is that a genome sequencing project has a point where you can call it ‘finished’, while this is not really true for structural genomics (and a number of other large-scale efforts). A very important aspect here is that a projects in the ‘finished’ state is very different from being 99% finished. Only a genuinely complete genome sequence allows conclusions on what is not present in the genome. For many applications, these conclusions can be equally important to what is present.

This issue is obviously important if you are interested in biological pathways: for many organisms, in particular microbes, we know much more about the metabolic pathways used from looking at their genome sequence than from directly studying the organism. In a strict sense, inferring pathways from genome sequence is a mere prediction, but typically a very strong one, which is unlikely to be overturned by direct biochemical analysis. Everybody working in this field knows that for predicting the main metabolic reactions of an unknown organisms, it is not only important to know which enzymes are there, but also which ones are not.

A second area where knowledge on absent genes can be crucial, is close to my own research area: the prediction of the function of unknown gene products. One very promising approach compares the presence and absence of genes (or rather: homology groups) over a wide range of different organisms. The idea is to search for characterized genes that have the same phyletic distribution, i.e. are always present in organisms that also contain the gene of interest, but are absent in all the others. This conserved co-occurrence is interpreted as suggesting that the genes work in a common pathway, or at least a common biological process. This “phylogenetic profiling” method has been first formally published in 1999, but has certainly been applied by many researchers (including myself) long before that date. The more organisms have their genomes completely sequenced (and the more diverse they are) the more useful will this method become. Thus, I am convinced that the best days of phylogenetic profiling are still to come. Obviously, incompletely sequence genomes are almost useless for this application.

There are probably many more applications that make use of completely sequence genomes. I haven’t even talked about applications of other complete high-throughput approaches – this will have to wait for more of those approaches to become available. It would certainly be interesting to analyze sets of complete protein interaction data, which also would give you reliable information on what proteins do not interact. I am not sure if this will ever be feasible, though.

Structural genomics, when applied to all organisms, is very different. It cannot be expected to ever reach a point of completion, at least not a reliable one. It could be envisaged to have a complete set of structures for selected organisms, though. But even in this case, it is hard to imagine what useful information can be gleaned from a complete structural complement (how do they call it? structureome?) . I have seen papers in the context of structural genomics that talk about the percentage of different folds being used by different organisms. To me, this looks a bit like collecting stamps. Is it really useful to know that organism X doesn’t have a single instance of the beta-grasp fold? Well, maybe I am just ignorant and don’t understand the important questions in structural biology. So, if you have an idea, please let me know.

At the moment, I would rather support large scale projects that really take a benefit from being complete, i.e. where 100% completeness is much more than twice the value of 50% completeness. Please, don’t get me wrong: I am not saying that this structural genomics initiative is not useful! It has produced a lot of interesting structures, and even I (as a sequence- rather than a structure-person) have taken ample profit from the structural data that has been generated.

P.S. this topic has concerned me for a while, but this post was prompted by the recent claim that the Cyanidioschyzon merolae genome project has produced the first complete eukaryotic genome sequence. This rather surprising claim has also been discussed in Steven Salzberg’s blog.



  1. I agree… Re structural genomics, have you seen



  2. I agree that complete sets are more informative than incomplete ones. Having worked with protein-interaction networks during my PhD I know the difficulties of not having a handle on current coverage and not having proper negative sets. Still there is a lot we can learn from incomplete sets. In the case of structure we can know a lot about the function of proteins from their structure.

    Structural genomics is also different from genome sequencing in the sense that you can infer structure from sequence by homology modelling. So having a full coverage for most species can be obtained by targeting sequences that are not similar to already solved structures. For my particular interest of predicting protein interactions the problem is more that this programs aim to cover folds and not complexes. Since the same proteins can use different surfaces to interact with different partners the folds are only the first step. We are not good enough at docking yet so to know how the interaction surface looks like between two folds we still need to solve them. (see papers by Aloy and Russell for more info)

  3. Pedro is right. You can tell a lot from just elucidating structures that essentially cover the missing structural space required to get you into the regime where you can start building structural models. Of course function is not determined by global structure, so there is still a lot that needs to be done towards improved sidechain prediction, and accurate prediction of molecular recognition properties. The PSI never intended to solve every structure, just get enough coverage to do the rest via homology modeling.

    Structural elucidation is also very hard. It’s a reason companies that were dedicated to structural elucidation either didn’t succeed or went a different route, e.g. co-crystallization of compounds with a particular protein rapidly for trying to identify better lead candidates. The protein expression part was always a huge bottleneck.

    So while genomic sequencing projects make a lot of sense as you state, I am not sure that structural genomics is that bad of an idea, although right now we need better modeling methods (homology modeling, protein interaction modeling, docking, etc).

  4. You can compare structural genomics to the high-throughput protein-protein interaction experiments: in both cases data are somehow incomplete and not very detailed (in p-p interaction experiments you don’t know molecular details of the interaction, as Pedro pointed out, but also you even don’t know which domain of multidomain protein is involved in that interaction). But still that doesn’t make both unnecessary/unusable.

    I’ll correct the goal of the PSI – although they may state that they aim sampling the structural space well enough to solve the rest by a homology modelling. What is not clearly said, is that sampling is limited to the cytoplasmic proteins, or these which crystallize easily. Difficult to crystallize (requiring non-standard conditions, or other special treatment) are just skipped – for example all beta-barrels are out of reach. And I don’t think they will approach membrane proteins anytime soon.

    In general, I think all high-throughput experiments should be treated equally – I’m very happy with more sequence data (also enviromental ones, they are extremely useful), more structural data, more interaction data, etc. Treating a protein only as a sequence, or structure, or an interaction machinery makes the view very limited, doesn’t it?

  5. First, thank you for your comments! From reading what you say, I see that I didn’t make myself clear enough. The idea of my post was not to say that structural genomics is bad or even useless. I tried to say that (I even used an exclamation mark!) but maybe it was too far down my post and nobody noticed.
    The main purpose of my post was to draw attention to the fact that in some large-scale efforts there can be a big difference between ‘almost complete’ and ‘complete’, because only the latter situation allows to draw conclusions that rely on the proven absence of particular features. I mentioned the PSI as one large-scale effort that would not really profit from ‘completion’. I see no point in an herculean efforts to finish every single structure, because in this case, 100% is only 5% better than 95%. Again, structural genomics is useful, ok?
    Now, some specifics: Titus, I haven’t read the Petsko article, as I don’t have access to that journal. Knowing what Gregory writes most of the time, he probably complains about the PSI taking too much grant money away from small-scale structural work.
    Pedro, I am actually familiar with the Aloy/Russell work, as I am also interested in protein interactions. On another note, I am not really convinced that modelled structures will be all that useful, but time will show.
    Deepak, yes, I am roughly familiar with what the PSI wants to do. As stated above, I think it is useful but I am also convinced that it will not live up to all of the promises made at the onset. Again, time will tell.
    freesci, I undestand that it is a pity that the PSI will not touch membrane proteins (particular for those of us who are working with those proteins), but these large-scale projects live on numbers (of solved structures) and I can understand that they leave the difficult cases to the ‘experts’. This is similar to what the genome projects do, many of them also don’t care much about centromeric, telomeric and other gene-poor regions.

  6. Kay, I probably was suggested by the title 😉

    I know pretty well what PSI centers are aiming for (I worked in one of these for couple of months). My point was that “complete” in case of structural genomics is unreachable, since they do not even try to accomplish that – even considering sampling.

    Anyway, your post makes me wonder: what is missing in the “almost complete” genomes? Only gene-poor regions or is there anything else?

  7. As the says, it seems that genomes rich in repetitive sequence are difficult to complete. From computational aspect, it is difficult to find the direction of join if the ends of strings have similarity. Eg, if shotgun emits AATGCGTAA , AACCCGCTAA and AAGTCGCGCTAA and we have head and tail of 2 nucleotides as overlaps, we are really not sure if the main fragments whose subfragments are above is AATGCGTAACCCGCTAAGTCGCGCTAA, AATGCGTAAGTCGCGCTAACCCGCTAA, AACCCGCTAAGTCGCGCTAATGCGTAA, AATGCGTAAGTCGCGCTAACCCGCTAA, AACCCGCTAATGCGTAAGTCGCGCTAA or AAGTCGCGCTAATGCGTAACCCGCTAA unless we do site directed sequence walks.


    I’m tired of reading this guy’s Genome Biology articles. “Filling the fold catalog might be of interest to bioinformaticists, but why should they drive the science that others do?” Well, because research is data-driven and bioinformaticians are skilled in handling data, perhaps. He has another piece entitled Jumping the Shark in a similar vein – that anything “-omics” is stamp collecting, not hypothesis-driven science. In short, he’s very narrow-minded and ignorant about the power of data mining. And not as funny as he thinks. Like I want to read 5 paragraphs about classic US TV shows in a genomics editorial.

    Also “the stated aim of Structural Genomics is determination of the three-dimensional structures of all proteins” is simply not true. Many SG projects focus on subsets of proteins: those with no known structural homolog, or from a particular model organism/system, or with some biomedical relevance. I agree that SG completeness differs in character from sequence completeness but really, we’re talking about two entirely different things.

    There are also times with data when you have to take what you can get. The draft genome sequence of a bacterium isn’t much use for phylogeny, for instance, but can give you a good idea of its potential metabolic capabilities.

  9. Key

    I think at this point we can pretty much agree that the original goals for the PSI were over optimistic. I think if they get 80% of th way there, one could consider things a success and similiar to the genome projects, that would only be the start.

  10. Come on Neil, I am sure you meet such guys everyday… so far myopia is a treatable condition but retinal damage is not.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


%d bloggers like this: