Posted by: Kay at Suicyte | June 26, 2007

More thoughts on GO

In the last few days, I have written two posts on the Gene Ontology project (GO), the first and rather polemic one on the problems I encountered when using GO, the second one making some suggestions on how to improve GO.

Now, I am having second thoughts on whether it was a good idea to criticize a resource that takes so much effort to create and maintain, but nevertheless is free for all. It is clearly a kind of work that is useful for many scientists, and also something most people (including myself) wouldn’t want to do on their own. So after all, we should be thankful that this project exists. Nevertheless, after pondering this question for the last day, I think that some amount of constructive criticism is warranted.

For one thing, there is no shortage of reports on how great GO is, and how many problems in biology it is going to solve. To give you just a selection, there are:

There is another, maybe more serious issue: I have seen an increasing number of papers that describe new tools for things as clustering or function prediction, which actually use the GO annotations as a gold standard for benchmarking their methods. This is something that makes my hair stand on end.

It seems that I am not the only one concerned about GO quality. There are also a number of papers (free download) that deal with evaluating GO and correcting errors:

In particular the latter paper finds the annotation error rate to be in the range of 28-30%. To my surprise, the authors call this error rate “reasonably low” – maybe my expectations are just too high?

Finally, here is a paper that makes suggestions how to use expert systems for creating GO annotations: An evaluation of GO annotation retrieval for BioCreAtIvE and GOA. It is more or less the opposite of what I would like to see, but I guess this will be the future of GO.

I have to say that I expected to see some more comments, maybe even to get roasted by infuriated GO fans. I am not sure if the lack of feedback means that people tend to agree, or if nobody is interested in GO. One interesting comment from Jacob Frelinger pointed me to an online article by Clay Shirky, entitled “Ontology is Overrated: Categories, Links, and Tags”. While this text is a pleasure to read and has many interesting facets, it generally focuses on a different kind of problems with ontologies. I think that the concept of an ontology is suited for GO’s purpose, but it should be handled in a more quality-controlled manner.

P.S. I will write a mail to the GO helpdesk and point them to the relevant entries in this blog – maybe that will work to provoke the toxic comments I am expecting.



  1. This was not the “Warnock’s dilemma” as I’ve never significantly used GO. However I was looking for a good introduction on this subject, so thank you for your references 🙂

  2. Like Pierre, I have not used GO in my papers so far (though most of my work so far has been on microarray data analysis). But looking at your 3 articles, seems like it has to ‘GO away’ and be replaced with similar expert system with different design (any alternatives to these DAG, can we design a better ontology where there is only one parent?) and human expert curation.

  3. That last paper with the “28-30%” error rate is disingenious. They estimate a (huge) error rate for ISS (Inferred from sequence/structure similarity) associations, but don’t bother to to find a single example of an error, nor discuss the classes of errors.

    Probably the biggest assumption is that only associations with the exact same GO term are considered a match – and since most sequence-based associations by necessity are going to map a more general function (higher up the DAG), it is not surprising that they don’t match exactly.

  4. This is probably more applicable to your previous post on how to improve GO, but I feel that I need to comment about the assumption you are clearly making on how genes get GO annotations.

    Let me preface this by saying that I can really only comment on the functional classification (although I think that’s probably the most complex part).

    Essentially, there are two totally separate issues regarding GO.

    First is the ontology. This is maintained by the GO consortium, and new terms get requested/suggested by labs that are doing genome annotations. The organization of the ontology proper is maintained by the people hired by GO, but they get a lot of feedback from the community. As I mentioned, it’s mainly adding terms, but they also get regular suggestions for reorganization or clarification of terms. Unfortunately, it’s _much_ easier to add a term than to change one, because people have probably already started to use the existing term, so all you can do is deprecate the existing term and add a reference to the new term(s) that replace it. Likewise, if you move one, you have to do the same thing, because one of the guiding rules behind GO, for better or worse, is that you have to have strict subsumption (i.e. every parent term’s functions must apply to all child terms), and if you change the parent of a term, you are basically changing its identity, and thus have to create a new term. So you see the problem with changing the tree. That said, when you find a problem (e.g. with prolactin receptor binding being too broad of a category for citokines, let them know at )

    Now, with regard to annotated genes, I’m going to go on a limb here (since I’m not positive I’m right), and argue that virtually all, if not all, of the genes annotated with a particular GO term are coming from external sources, and were not assigned by anyone at GO. The quality of the annotation is therefore directly dependent on who did the annotation, and different groups have wildly different standards as to how they annotate. The larger sequencing centers have the advantage that they generally have people trained in assigning GO terms, but, on the other hand, they generally aren’t doing any bench work to verify what they’ve got. This means that we (as people sequencing new genomes) generally have to rely on groups like Uniprot or Brenda to do the literature searches to find experimentally characterized genes that we can then use as exemplars for annotating our new genes.

    For example, with regard to SOCS2 (Human) that you mentioned in your first post, the prolactin receptor binding annotation comes from Uniprot. In addition, the source is listed as being NAS (Non-traceable Author Statement), which is not really a gold standard when it comes to attribution. What this means is that whoever at Uniprot was annotating the gene decided that it really _ought_ to be annotated with prolactin receptor binding, but couldn’t find any article that explicitly stated this.

    This brings up another point. The GO evidence code is vital to knowing how much to trust the annotation. However much the GO folk try to avoid outright stating it, the evidence codes are (to a certain extent) hierarchical. More to the point, IDA, IPI, IGI, IMP, and IEP imply that the function was determined by actual lab work. I’d be very suspicious of anything that’s only annotated by ISS or IEA, which may very well mean that it’s just a blast hit to a known protein (or worse yet, a blast hit to a protein that was annotated based on another blast hit). Take a look at ttp:// for an explanation of the codes.

  5. Alex, you raise a number of issues that I would like to comment on.

    1) I totally agree with your point that it is very difficult to change an existing GO term, in particular when it comes to re-arranging the hierarchy. The more important is it to get it right the first time. For example, you mention the “strict subsumption (i.e. every parent term’s functions must apply to all child terms)”. This subsumption has clearly been violated in a number of examples (e.g. “cytokine activity” is a parent of “prolactin receptor binding”, although there are many proteins that bind to the prolactin receptor but are not cytokines). Another example I did not mentioned in my blog but in a private mail to the GO folks is “negative regulation of apoptosis” as a grand-parent of “negative regulation of anti-apoptosis”. It is obvious that not all negative regulators of anti-apoptosis are also negative regulators of apoptosis.

    2) I am aware of the fact that GO annotation is run by various institutions and not necessarily those involved in ontology creation. I consider this as a disadvantage, as it is an invitation to put off the responsibility for bugs. For example, let us assume that there is a large discrepancy between the human and the murine annotation with regard to what is labeled a chemokine. From the perspective of a naive user (who doesn’t know what a chemokine is), it is obvious that there is a problem. But who is responsible? Did the mouse annotators label too few genes? Did the human annotators label too many genes? Is the ontology definition too ambiguous for making clear-cut decisions? Many users cannot judge that. When the web page of the GO consortium talks about annotation, they say “we”, so I guess that a typical user sees everything GO-related as the responsibility of the consortium.

    3) I am aware of the evidence codes but did not even touch this subject. In my post, I was more concerned about principal flaws, not just single instances. It is possible that SOCS2 does not bind to the prolactin receptor, I don’t want to argue about that. But even if there were plenty of evidence that it does, the classification as “prolactin receptor binding” would nevertheless be problematic, as this automatically infers the parent-label “cytokine activity”. This is the problem I am after.

  6. Alex –
    There are 2 primary sources of go associations – 1 is the “MODs” or model organism database, such as wormbase, flybase, SGD, etc. They are responsible for ALL annotations to “their” organisms.
    The second source is a project called GOA-UNIPROT which is responsible for annotating “everything else” (ie., all sequeces in uniprot). They have the bulk of them, but they are primarily “IEA” or uncurated electronic annoations. Typically these are from an interpro to go mapping via interpro HMM. They are highly accurate, but not very specific (i.e,. great at mapping something to “protein kinase activity”, but not some complicated process).

    They all interact closely with the GO consortium.

  7. The IEA and ISS associations are quite accurate – they are just not very specific. The ISS associations are actually curated results – that is to say, someone published a paper stating that gene a and gene b shared the same function via sequence (or structural) similarity. IEAs are mostly from interpro, so they are no better or worse than your typical HMM. The mappings for interpro-to-go are curated by humans.

  8. Zenbitz –
    Thanks for your comments. Very much appreciated. The fact that GOA and the species-specific annotations have a big overlap in annotated genes offers the useful possibility to compare annotation quality. As I mentioned in my post, this also extends to comparison of annotations in closely related species, where orthology detection is not a big problem. I still maintain that major discrepancies between these annotations should be a reason for concern (and should spawn some action by the GO initiative).

    The idea to included GO-Interpro mappings might be a good idea, but for us (and many others doing gene set enrichment analyses it is mostly useless or even detrimental. We routinely not only analyse enrichment of GO terms but also Interpro domains and other properties. To have all Interpro mappings also as part of GO makes matters more complicated. Also, we typically do not use all of INTERPRO, but only the component databases that are useful.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s


%d bloggers like this: