This week, I found two strange but interesting papers, one from the area of bioinformatics, the other from gene silencing. The unifying topic is that in both published works something went wrong, but the results were nevertheless as good – or even better – than intended.
The first paper, published in the latest issue of Nature Biotechnology, is from Mark Styczynski et al. The title is “BLOSUM62 miscalculations improve search performance“. I must say that I was very surprised to see a paper like this. BLOSUM substitution matrices are widely used throughout bioinformatics. In particular the BLOSUM62 version is popular as it is used per default in NCBI’s protein BLAST applications. I am a heavy user of BLOSUM matrices myself; on my computer it is the BLOSUM45 matrix doing most of the protein comparison work. In case you wonder what I am talking about: When you want to compare protein sequences, it is not good enough to just score identical or non-identical residues – you need some measure of amino acid similarity. For this purpose, you (or rather your program) turns to a ‘similarity matrix’, a.k.a. ‘substitution matrix, which tabulates a similarity value for each possible amino acid comparison. Initially, a number of different similarity matrices have been tried. It soon turned out that the best performance was obtained from matrices that do not score some kind of physical similarity between amino acids, but rather the frequency of one amino acid being substituted by evolution for another one. For example, you often find Leucine replacing Isoleucine (resulting in a good score for the pair Leu:Ile) but you rarely find Leucine replacing Aspartate (resulting in a bad score for Leu:Asp). In 1992, Steven and Jorja Henikoff analysed a large number of amino acid substitutions in a database of sequence alignments at varying evolutionary distances. The result of this analysis was the BLOSUM series of substitution matrices, with the most popular BLOSUM62 being derived from sequence alignments with more than 62% sequence identity (if I remember correctly).
Now, 16 years later, Styczynski et al apparently found a bug in the program used by the Henikoffs to calculate the BLOSUM matrices. In their paper, the authors describe a re-calculated ‘correct’ version of the BLOSUM62 matrix and compare it with the ‘wrong’ Henikoff version for its performance in sequence comparison applications. Most surprisingly, they report that the original BLOSUM62 version performs better than the correct one!
Here is a quote from the Styczynski paper:
We find it interesting that the BLOSUM62 matrix is used every day (and more interesting still that its derivation is a common topic in computational biology classes), and yet we can find no previously published mention of any of the errors discussed here. We did find that some of the errors were fixed in later tangential work by the original authors, but the ‘correct’ matrices have never been published or adopted. We also note that the existence of statistically significant improvements due to (essentially random) software errors supports the notion that there is significant room for improvement in our understanding of protein evolution. Of course, software errors are quite common and nothing special; however, it is at least a curiosity that these errors stayed buried for so long and have been improving BLAST searches (ever so marginally) for the past 15 years.
Maybe someone should have a look at the BLAST sources and try to find the bug that makes BLAST perform so well.