Frequent readers of my blog will know that I am more into sequences than into structure. I don’t want to go into details why this is so, maybe another time. Here is the one-sentence version: My goal is the prediction of protein function, and – contrary to what is being reiterated in the literature – I am convinced that this is better done directly from sequence, paying close attention to evolution. Going via structure is a detour, if not a dead end. Ok, two sentences.
Occasionally, though, I have to do some structure analysis. No mistake, I am using structures all the time. Nothing beats a structural superposition if you have to align two extremely divergent sequences. If there is a structure. I also use structures (if available) to make plausibility checks of my sequence-based predictions. And, from time to time, I have to deal with structures for different reasons.
Today was such a day. I was analyzing a sequence family (nothing exciting, it was a contract work for a customer), with one member having a highly-resolved X-ray structure. For one aspect of my analysis, I decided it was a good idea to select residues that are highly conserved but whose side chains are pointing to the outside of the structure. The idea was that in real life these conserved surface residues might contact a potential interaction partner that is not present in the structure. This task looks easy enough, in particular as I have done similar things before. I could not remember how I had done it the last time, but at first I did not expect this to be a problem. There must be hundreds of programs out there that take a PDB file and give you a list of residues indicating their degree of surface exposure.
I turned to google, but despite two hours spent at the computer, and using all the tricks in the book, the only promising hit I could find was a program called naccess. Plus lots of papers in NAR and Bioinformatics, either pointing to naccess or to some web pages talking about Error 404. This latter trend must have be blogged before: it is great that journal articles describe useful software and web services, and that they also provide links. But why is a (perceived) 90% of these links broken after 2-3 years?
Anyway, if everybody is using naccess, why shouldn’t I? Well, there are a number of reasons. The first one is a note on the home page of naccess, saying “Industrial user/Profit-makers, please read this”. I hate it if a web page starts like this. It means that you have to pay for using the program or service. What’s more, those people (almost) never say “obtaining this software will cost you x $”. They always ask you to get in contact with some IP protection department, which involves talking to lawyers and signing some contract, which (of course) I am not allowed to do without talking to our lawyers first, who in turn find some conditions in the contract utterly unacceptable, while the other lawyers insist that this clause is essential, and so on, ad nauseam. If, after some discussion, we get to a point where the amount of money is actually mentioned, it typically exceeds my group’s annual budget (including hardware!) considerably. Ok, I must admit that I did not even try to buy naccess, so I have no idea about the special conditions applying here, but I do have related experience from previous occasions. Why do these departments always assume that people working for a company are swimming in money? Maybe some do, but I don’t.
This is not all. Even honorable academics (a.k.a. Jedi researchers) have a hard time getting their hands on naccess. As the download pages says:
You will receive a compressed tar file via ftp containing everything you need. The tar file has been encrypted. You need to get a decryption key to decrypt the file. See later.
We ask users to complete a a short Confidentiality agreement. Please print it out, sign it and return it to us via normal mail (Not email or fax please!).
You see, very convenient. When I read this page, it reminded me of other such examples from ancient times, before I succumbed to the dark side of science. It occurred to me that this kind of restrictions only seem to be used for sofware dealing with structure analysis, never with sequence analysis. Did you have to sign a contract before using BLAST? FASTA? Do you have to pay 20,000$ for using ClustalW (or T-Coffee, MUSCLE or Mafft ?) Is the use of Pfam or INTERPRO restricted to academics? Do you have to send something by paper mail to Sean Eddy before using HMMER? Of course not, what a ridiculous idea!
Structure analysis seems to be a different culture. Lots of restrictions whereever you look. There are free structure viewers (check the interesting posting on “Freelancing Science“), but if you read the fineprint, almost all viewers capable of producing publication-quality output may not be used by company employees. (Ok, they may be used, but the $$$ and other conditions amount to them being useless for me). There are some examples of viewers that are cheap or free for people like me, these are often scaled-down versions of commercial software. And if you look for structure analysis other than viewing and rotating PDBs on screen, matters look even bleaker.
Let us get back to my original surface exposure problem. After abandoning the idea of using naccess, I found one (probably) quick & dirty solution that worked for my one-off application: I turned to spdbv (DeepView), one of the pdb viewers I am allowed to use. This might not be the nicest program to look at (strange user interface, poor fonts, some linux problems), but it is great for structure superpositions. It also turned out to have a feature to select residues on the basis of their %surface exposure. This was exactly what I needed, although I did not find a way of getting a text output of this information. Anyway, a little manual work – case closed. I should add that – after finishing my work – I did find a fine web server that does exactly what I wanted. It is called GetArea, and I have no idea why I didn’t see it in my previous searches. Maybe I should exercise my google skills.
Here a a few more links on bioinformatics software availability:
- This is the position of the ISCB (via Deepak).
- An interesting statistics of software availability found on Flags and Lollipops.
- And finally, a missing link. A few month ago there was an interesting posting on this topic somewhere, followed by a lively discussion (with contributions by Sean Eddy and a few other notable figures). If a am not mistaken, I also posted a comment, but I cannot find the posting anymore. Is is possible to search for blog comments?