Please excuse the rude title of this post – it is crafted after a number of recent blog postings on the shortcomings of Wikipedia. I for myself have been a very late adaptor of wikipedia and had never expected much from this resource – whenever I am using it now, I am pleasantly surprised how useful it is, even for most scientific topics. Obviously, you cannot believe everything you find in Wikipedia, and I am quite concerned about the growing number of young students, who apparently learn biochemistry from Wikipedia rather than from textbooks – this is certainly not the way to go. However, for a quick look-up of the formula of orlistat (tetrahydrolipstatin, “alli“) and stuff like that, Wikipedia is much more convenient than any other resource I know of. And, as has been said before, nothing beats Wikipedia if you want to know the difference between a ‘mad scientist’ and an ‘evil genius’.
Now, let me focus on a very different kind of resource: Gene Ontology (GO). Like Wikipedia, GO is free and has the potential to be extremely useful (but for a very restricted audience). Unlike Wikipedia, it has been created and is maintained by a number of domain experts (who are paid for doing it). Nowadays, I use both GO and Wikipedia frequently, and fully appreciate the amount of work required for their creation. However, if I had to single out one resource that can drive me up the wall, this would certainly be GO, not Wikipedia.
For those of you who are not familiar with GO, here is a brief explanation. First, let the GO folks have their say:
The Gene Ontology (GO) project is a collaborative effort to address the need for consistent descriptions of gene products in different databases. The GO collaborators are developing three structured, controlled vocabularies (ontologies) that describe gene products in terms of their associated biological processes, cellular components and molecular functions in a species-independent manner.
To put it bluntly: The GO consortium provides a number of categories, which can be used for labeling the gene products of all species. The categories have meaningful names and are connected to each other in a type of hierarchical arrangement called a DAG (for directed acyclic graph). This means that each category can have multiple subcategories and can be part of multiple super-categories. It also means that when following such a path from super-category to category to subcategory, there is no danger of running in circles, i.e. reaching a point that you have visited before.
I have no doubt that the DAG structure of GO makes a lot of sense and is quite useful. Most of the categories and connections are perfectly sensible. In any case, it is not so much the structure and connection pattern of the GO categories that I am worried about, but rather the way of how the categories are used to label the actual gene products. This being said, there are a few categories that are less intuitive than others. Whoever thought it was a good idea to have one category called “negative regulation of apoptosis” and another one called “positive regulation of anti-apoptosis”? By the way, the former category contains 705 genes, the latter one 32 genes. So, even if my focus is on the errors of mapping the categories go genes, it should be kept in mind that providing illogical categories means calling for illogical annotations.
Let me now get to my main point, outlining why the existing GO-to-gene-mappings
suck leave room for improvement. One of the typical scenarios where GO comes handy is having a list of genes (a few tens to a few hundred) – usually as the result of some large-scale experiment – combined with the task to find some kind of common feature or biological theme hidden in the gene list. For being able to do this, you need a comprehensive set of GO annotations for most if not all genes in the list. A GO mapping database that covers only a few well-studied genes is of no help, what you need is a good coverage. Reaching this coverage must have been the main driving force for all GO mapping projects, with the consequence that annotation quality is sacrificed for enhanced quantity. These problems are to be expected and – if kept at a certain level – could probably be tolerated. Here is on more clipping from the GO website:
Although we endeavour to make mappings as accurate as possible, we cannot guarantee that the mappings provided by the GO project are either complete or exact. This may be due to the absence of definitions from GO terms or from terms in some external systems; the GO ontologies and the external database may also have changed since the mapping was made. Please report any errors or suggest alternatives to the GO helpdesk.
Sounds fair enough. The only problem is that working with GO for several years has brought me to the brink of calling the GO helpdesk with the recommendation to hire new staff and start all over again. (Sorry guys, this had to be said, I am already feeling better now) Let me try to proceed systematically. I can see a number of problematic areas.
- High coverage is claimed and seems to be provided, at first glance. A closer inspection shows that high coverage is provided only for very broad categories, such as “regulation of biological process” and the like. A large proportion of the coverage does obviously not come from manual curation but rather from the blind application of sequence motif databases. An example: If you are working on collagens, you can be sure that each of them has a couple of GO categories assigned. A closer inspection shows that most of the collagens have the same GO annotations, although you know that they are doing very different things in biology. There are many more examples like this, where GO annotations are assigned to broad protein classes (kinases, phosphatases, ubiquitin ligases, etc) and each protein that has one of those domains will inherit this GO label – and nothing else.
- If you pick a random category (within your field of expertise), you will see that up to 50% of the genes assigned to this category don’t belong there, and a similar number of genes that should be there but are not. These errors point to a certain lack of biological understanding on the curator’s side.
- There are many cases, where the manual annotators apparently don’t understand the meaning of the GO categories. Admittedly, categories like “negative regulation of anti-apoptosis” do pose a certain challenge.
- Finally, there are major inconsistencies in the way related genes are treated. One has the impression that there are a number of curators, one for genes starting with letters A-C, one for D-F and so on, who don’t talk to each other and have very different opinions on how GO should work.
Let me give you an example, of rather a series of examples to make my point more clear. A colleague of mine was trying to use GO for assembling a list of human and mouse cytokines. He wasn’t happy with the result, and when I tried it myself, I could see why he was frustrated. It all started out quite promising, as there is a GO category called cytokine activity, which contains lots of genes. Like most other GO terms, there are also several sub-categories, each of which has a lot of associated genes. In a perfect world, the combination of genes in the parent category and those in the children (and grand-children) would constitute a complete set of cytokines. Unfortunately, this is far from being the case. Here are a few things that go wrong:
- When looking at the genes in the parent category “cytokine activity”, you see some interleukins (IL12, IL17, IL19) but not the others. You also see BMP4 and BMP5, but not the other bone morphogenetic proteins. Where are the other family members? The answers differ: for the interleukins, the other members are found in sub-categories, e.g. IL10 is found in a category “interleukin-10 receptor binding”, which is a child of the cytokine category. Apart from the question why there is no such category for IL12, IL17 and IL19, this arrangement is o.k. and all interleukins will be covered by combining the cytokine child categories. Matters are different for the BMPs. The other BMPs are nowhere to be found within to cytokine hierarchy, implying that according to GO, BMP4 and BMP5 are cytokines while BMP2, BMP3, etc. are not.
- Most of the subcategories of “cytokine activity” have names like “XXX-receptor binding” and contain a subgroup of cytokines that bind to this particular receptor. For many subfamilies, the coverage of these sub-categories is far from being complete. Look at the category fibroblast growth factor receptor binding, which contains two human genes: Fgf10 and FLJ00383. I don’t want to discuss whether ligands of the FGF receptor are cytokines, but I would really like to know why the GO annotators think that FLJ00383 (a lysosomal ATP synthase) binds to the FGF receptor, but the other FGFs like FGF1, FGF2 … do not.
- There is a whole bunch of “type 3” problems in the various sub-categories of this example. One such category is called prolactin receptor binding and – being a child of the cytokine category – is intended to contain ligands of this particular receptor. As expected, this category contains the prolactin gene. Unexpectedly, it also contains SOCS2, a “suppressor of cytokine signaling”. In a way, this classification is not completely wrong, as SOCS2 probably binds to the prolactin receptor. However, it does so from the cytoplasmic side and confers ubiquitination to the receptor – SOCS2 is certainly not a cytokine. Nevertheless, being a member of this subfamily, it inherits the “cytokine activity” label from its parent node and will show up as a cytokine in all GO-based studies. This is no one-off phenomenon; other subfamilies contain “cytokines” as unlikely as Cdk5 (a cell cycle kinase), erbin (binds to the cytoplasmic part of ErbB2 receptors) ErbB2 (a receptor, part of a receptor dimer), PIK3R1 ( a PI-3 kinase), Syntenin-1 (another cytoplasmic adaptor protein), Taxilin, TRIP6, TRAP1 (a mitochondrial HSP75), and many others.
- Several important cytokines are missing. As an example, only very few members of the TNF family are represented (TNF, RANKL, CD40L), while the other 17 (!) are missing from the list.
- Consistency between species is very poor. When looking at the mouse list, we learn that murine BMP2 and BMP3 are considered cytokines (which is not the case for the human orthologs). On the other hand, the mouse has only 10 GO-approved chemokines, while humans are reported to have 44 of them (including the complement factor C5!)
I could go on forever, but it is getting late. I feel that I should end this post with some comforting remarks, saying that despite these shortcomings, GO is still a fantastic resource. I am just not in the mood to do so. The only thing I can say is that I don’t know of any better available resource, and this is why I will keep on using GO, at least for the foreseeable future.