Posted by: Kay at Suicyte | April 6, 2008

Random Stuff, April 08

Here are just two interesting stories I read on other peoples blogs:

First, Jake Young at Pure Pedantry blogs about a recent Cell paper by Sakaue-Sawano et al. who present a clever application of protein ubiquitination for visualizing the cell cycle stage of cells in vivo. Lars Juhl Jensen at Buried Treasure has also picked up this story, and those two blogs provide a lot of detail on the method, including a link to a nice video showing HeLa cells passing through 3 cell cycles. In brief, the authors of this paper exploit the fact that several protein ubiquitination systems are only active during particular phases of the cell cycle. On one hand, there is the APC/Cyclosome system, which degrades target proteins only in late mitosis (APC means Anaphase Promoting Complex) and in G1 phase. Conversely, the SCF-Skp2 system is mainly active in S and G2 phases. By coupling two different dyes to target proteins of APC/C and SCF-Skp2 respectively, it was possible to observe an oscillation between red and green colors as the cells go through the different cell cycle phases.

On a very different note, Peter Murray-Rust blogs about the lack of data mining possibility in the Pubmed Central resource. He goes on to discuss if – in the light of this shortcoming – PMC can still be considered an open access resource. I can understand his concern and would certainly welcome if PMC and other open access scientific repositories can be used for automatic text mining efforts. However, what really struck me was the statement:

When George Bush signed the mandate he clearly envisaged that the information should be used for the benefit of human health…
…and this means text-mining.

I am not sure how serious Peter was when writing this. I hope not too much. I cannot think of a single example where text-mining has ever made a major contribution to solving any real-life biomedical problem. Even if there are such eamples, their number will be small. If we compare the health benefits from text mining efforts to those provided by real (human) scientist reading the literature, I have no doubt that the latter would prevail by a big margin.

There should be no doubt about it, it would clearly be a good thing to enable text-mining on PMC. However, describing the current situation of free access to PMC papers for scientists as useless without added text-mining capabilities appears to be, well, kind of biased.



  1. I must disagree. Humans cannot compete with machines, not to scale anyway. That’s the argument Mahalo etc make wrt Google. That we need humans is only a manifestation of how poor text mining algos are today (at least publicly available ones) and the lack of linked data and inability to extract meaningful information.

    The goal of text mining is not to find things humans can, but to find things they might miss. Also the idea is to integrate mining papers with other information to highlight or flag something that a human can then go and examine.

    I do think that the point of view above is more relevant in distributed organizations like a pharma company, but don’t think that the importance of text mining should be diminished. One could argue that if clinical trial results are published appropriately, there will be a huge impact from mining those publications, for potential adverse effects, etc. In addition, text mining is critical for making other business intelligence decisions and for creating content databases which are very relevant.

    I could go on. One could make arguments from both sides, but in the grand scale of things, allowing machines to do a lot of work is essential.

  2. I think there are 2 issues here.

    (1) Should public, open data be amenable to text mining? Yes, of course.

    (2) Is text mining in its current state of any practical use?
    I am not an expert – there may be really good examples of which I’m not aware. One tool that works quite well is RLIMS-P, for phosphorylation information. Aside from that, I’d agree with Kay that practical examples of text mining are few and far between.

    However, we have to assume that data mining will continue to improve, either through better algorithms or better markup (semantic web). So making data available for that purpose is important.

  3. There are examples, but unfortunately they are the ones that work best in proprietary environments (driven either by NLP, or lots of ontologies). What’s missing is this linked data driven discovery that would even out the playing field a little.

  4. It would seem to me that Peter and a lot of other people at the meeting in Dagstuhl have missed the fact that you can download the entire Open Access subset of PubMed Central from their FTP service. So NCBI simply blocks robots because they want you to access the data in the right way.

    Regarding the usefulness of text mining, I think that it is very easy to underestimate its importance.

    1) Every researcher relies on information retrieval methods to find the papers in the first place. If you cannot find the relevant papers, you cannot built upon what others have done before you.

    2) I agree that the information extraction methods that were mentioned by others do not directly lead to new discoveries. However, they are frequently used to help curators make the databases that we depend on every day.

    3) There are actually cases where text data mining was used to make discoveries of direct medical relevance. The most famous examples are the links between Raynaud syndrome and fish oil and between migraine and magnesium deficiency.

  5. […] response to my concern about access to the full text in PubmedCentral the Blog Suicyte Notes questions the value of text-mining: I cannot think of a single example where text-mining has ever […]

  6. As you will have noticed, my posting was somewhat more provocative than usual. I am always sensitive to overhyped areas of bioinformatics, and besides system biology, text-mining is a prime contender.
    Unlike system biology, I am convinced that text-mining could become useful for the molecular biosciences in the intermediate term (i.e. during my lifetime). I should add that I am only talking about the molecular biosciences in the broadest sense here – I know that Peter is talking about chemistry, where the situation might be entirely different.
    Whenever I attend a bioinformatics meeting, last time ISM07 (see here and here), I hear lots of fancy talks about text mining with lots of promises. However, when I come back and have a look at the available text mining tools, or at databases that have been derived from text mining work, I never find anything that looks remotely useful. It is possible that Deepak is right and the useful tools are all proprietary. On the other hand, there are so many text mining efforts paid from grant money (e.g. EC framework programs), and I really wonder why nothing more useful than iHOP has come out so far.

  7. I also wonder why there are such few tools. I suspect it’s because Text mining, especially NLP, is hard, at least in the absence of linked data. That said, Peter’s chemistry background does lend itself to being more irritated. If you could mine InChI’s, etc from papers, it would be a huge plus

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s


%d bloggers like this: