This is a follow-up to my recent post on the shortcomings of the current Gene Ontology (GO) database. As you might have noticed, my last contribution was written in a moment of frustration, caused by attempt to use GO for something useful in the real (biological) world. Today, I have calmed down a bit, and would like to use this occasion to discuss a few possibilities for improving GO.
Here are some general factors that contribute to the problems with GO, as perceived by me and maybe others.
- GO might have been designed with something else in mind, differing from what GO is typically being used for (mainly annotation of gene lists, and gene set enrichment analysis). Many GO terms are virtually useless for this kind of application, although they might be very important for a more philosophical approach to biological ontologies. If you don’t know what I am talking about, go to Wikipedia and look up what an ontology is. You will find explanations like:
Ontology can be said to study conceptions of reality; and, for the sake of distinction, at least to the extent to which its counterpart, epistemology can be represented as being a search for answers to the questions “What do you know?” and “How do you know it?”, ontology can be represented as a search for an answer to the question “What are the knowable things?”.
People most likely to use GO, e.g. those in microarray analysis, don’t care a lot about ‘conceptions of reality’, but rather want to create fancy pie charts. Sorry for getting sarcastic again, I promise I will stop that now.
- As Eric Jain pointed out in a comment, part of the problem lies not with GO itself but with the assumptions that GO uses have about GO’s purpose and quality.
- I don’t know how GO is actually managed. By the look of it, I would guess that those who make the categories and those who do the gene-mapping are different groups of people who don’t really talk to each other. At least not on a daily basis. In my previous post, I have already addressed some of the issues resulting from different interpretations of GO terms
- Finally, I think that the GO project has only limited access to real experts in the field of biological pathways. The problem here is that GO aims at covering all of biology, which clearly cannot be done by just a handful of experts – at least not in a thorough and comprehensive way.
So, what could be done to counter-act those problems? This is not an easy task, otherwise it would have been solved by the GO consortium already. One should not forget that a large number of very smart people are already involved in the project. Here are just a few suggestions from my part, which might or might not be feasible. At least in my view they would help creating a better GO, but who knows, maybe what I call ‘better’ would be considered as ‘adulterated’ by others.
- The architecture of the categories should be simplified. Every single category should be scrutinized for relevance, necessity and lack of ambiguity. By this process, we should get rid of duplications like “negative regulation of apoptosis” and “positive regulation of anti-apoptosis“. Other candidates for deletion, or at least re-evaluation would be e.g. the entire hierarchy “negative regulation of biological process” – is there really a common concept applying to genes that inhibit something? This hierarchy encompasses groups as disparate as “negative regulation of behavior” and “negative regulation of viral life cycle”. There are many more examples of terms that either appear unnecessary or redundant with others.
- The creation of categories should be done with gene-mapping in mind. The people who create two separate categories that have similar names, or describe similar concepts, should have a very clear idea which genes should go to what category.
- In a similar vein: the purpose of the categories should be documented in a way that allows the mapping people to judge which category is most suitable. If, for example, there turns out to be a good reason to keep inhibitors of apoptosis and activators of anti-apoptosis as separate groups (which I doubt), the corresponding documentation entries should say something like “this category A is meant to contain genes that do X and Y. Genes that do Z should not go here but rather go to category B”.
- The tools that are used by the term-to-gene mapping people should display these helpful documentation entries. They should also display – for each term assigned to a gene – the list of parent terms that are inherited in this process. This provision should help to avoid annotation traps like the ones mentioned in my previous post (annotating SOCS2 as GO:0005148 “prolactin receptor binding” and not noticing that this automatically classifies SOCS2 as having GO:0005125 : “cytokine activity” because the latter is a parent term of the former)
- Ideally, the annotation of each gene should be done independently by two scientists. If there are differences, the people involved should discuss and try to reach a consensus
- Quality control should be enhanced. One important means are consistency checks between multiple species (e.g. human and mouse, or fly and mosquito). For many organism groups orthology tables (like HomoloGene or Inparanoid) are available. Whenever orthologous genes are assigned to different GO categories, this is should be seen as a reason for concern.
- It would also help to have a very dedicated person at the top of the project. I am thinking of someone like the Amos Bairoch of the early SwissProt days – somebody who cannot sleep well as long as there is the chance of still having errors in GO. (To be honest, I have no idea who is in charge of GO, so chances are that lack of dedication is not an issue)
The previous suggestions all intend to reduce the number of GO errors. Maybe the hardest problem is how to get a high coverage of gene annotations despite the high quality threshold that I am asking for. As said before, this is probably not going to work without finding a way to get the real experts involved with the project. Paying world-class experts is probably not an option – we want to have a free resource after all. In my opinion, one way to go would be to run a GO-like project by a renowned institution or publisher, and to find a way that makes the permission to contribute to GO look like a great honor. Once this idea is firmly established in the minds of the scientific community, there will be no shortage of people who are eager to contribute. If the hosting organization is also running a range of review journals (Nature group et al.) this could be used for synergistic benefits, e.g. by requiring review authors to also provide their information in a GO-compatible format. Obviously, you would still require a number of paid ontology experts ,who can make sense of what the experts are writing – convincing a world-class biologist to use a controlled vocabulary is like asking Richard Stallman to use MS-Windows.