One of the things that can drive me up the wall is reading a paper that describes the discovery and characterization of a new protein, without giving a clue to the proteins identity. My bad mood is caused by a paper in the current issue of Cell, entitled “SCRAPPER-Dependent Ubiquitination of Active Zone Protein RIM1 Regulates Synaptic Vesicle Release“. I am interested in this protein because i) I am interested in almost anything connected to ubiquitination, ii) ubiquitination as a regulator of synaptic vesicle release is a new and rather unorthodox concept, and iii) the ubiquitination target RIM1 is no stranger to me.
The new Cell paper gives lots of information, a good part of which I still have to digest, but it lacks one important piece of information: what the heck is SCRAPPER? This situation is not uncommon, I cannot count the occasions where I had to use BLAST on published PCR primer sequences to find out what protein the paper is talking about. The case of SCRAPPER has turned out to be particularly recalcitrant. Here is a brief description of what can be done in situations like this.
First, let us have a look how the authors have identified SCRAPPER. In the results section, they say
We hypothesized that an E3 capable of regulating synaptic function would be membrane bound and would be expressed in neurons. To test this hypothesis we screened the human genome for genes whose coding sequence contained an F box domain (characteristic of E3 ligases), a membrane-targeting sequence, and whose promoter region contained both a neuron-restrictive silencing element and a cAMP-response element (CRE) within 3 kb upstream of exon 1. Only one gene was found with all of these properties. We cloned a full-length cDNA for the mouse ortholog and named the encoded protein “SCRAPPER.”
Reading this (and straying from the topic of this post), I must ask myself: is this how I would search for an E3 that regulates synaptic function? The answer, you might expect it, is a resounding No. For one thing, I would not restrict my search to the 69 F-box proteins in the human genome (which do not have a track record of being membrane-associated) but also take into account the 291 RING finger protein, the 27 HECT proteins , the 6 Ubox proteins, the 41 Socs box protein, maybe also the 126 BTB proteins, and a few more. I am also not so sure about the NRS and CRE elements. So far, so good. This is what the authors offer by way of SCRAPPER description:
SCRAPPER is a 438 amino acid protein that contains an F box, leucine-rich repeats (LRR), and a CAAX domain. The CAAX domain is a carboxyl-terminal membrane-sorting signal induced by prenylation.
Ok, so SCRAPPER is an FBXL protein (FBXL means Fbox + LRR, there are also FBXW and FBXO subclasses). Maybe the length of 438 residues and the CAAX motif are sufficient for identification, but in a (moderately) ideal world, we should not be forced to do this. Instead, most journals require the authors to submit new sequences to a database and supply an accession number. Apparently this is what the authors did, the say:
The NCBI accession number of Scrapper is 918964
One could be pedantic and complain that the NCBI is not a database but rather an institution that hosts different databases, each one with a different accession number system. However, there is a search page where you can search all NCBI accession numbers in one go. I tried this on 918964 and retrieved one gene entry (from Chlamydophila pneumoniae), one EST entry (saying “has been retired”), and not much else.
Other tricks that have been applied successfully in similar situations include
- Checking the figures for snippets of sequence alignments – no luck in the current manuscript
- Checking the methods section to see if the authors made an anti-peptide antibody. When they do, they typically show the sequence of the peptide used for immunization. Often, this information is sufficient to identify the protein. In this case, they did raise an antibody, but say only that it is “directed to amino acid residues 321–380 of mouse SCRAPPER”, which is not very helpful if you don’t know the sequence.
- Checking the methods section for PCR primers, RNAi constructs, or the like. If specified, they can often be used for identifying the correct sequence. Nothing useful was found in this manuscript.
- Checking EMBL/Genbank databases for recent entries submitted by the authors of the paper. Often, the authors submit the sequence with a different name and decide to change the name again for the publication. Unfortunately, this didn’t work either.
Is this the end of the line? Normally, I would say yes. In this particular case, I got lucky. I searched the patent database by the authors name (no success) and by the term ‘SCRAPPER’ – surprisingly, the latter gave a promising hit to 11 sequences, all attributed to a japanese patent JP 2007044041-A. The reason why I did not find this while searching by author name: in the paper, the corresponding authors name is Mitsutoshi Setou, in the patent the name is Mitsutoshi Sedo. Looks like this entry has been “lost in transliteration”. Some of the patent sequences correspond to RIM1, but others correspond to an Fbox/LRR protein called FBXL2.
Was it due to space limitations that the authors could not squeeze in a sentence saying that their SCRAPPER protein is identical to FBXL2, described in 2005 as a “geranylgeranylated cellular protein required for hepatitis C virus RNA replication” ? And why do we need a new name for a protein that already has a rather nice and systematic name? I am afraid we will never know.
Scott McGinnis (supposedly the one from the NCBI, author of the famous ‘BLAST-announce’ mails) has provided us with a link to the proper GenBank entry ABU95014. Since he found it by the same kind of search I had done before writing this post, I can only guess that it hasn’t been available at that time (reported creation date is September 7) Very importantly, SCRAPPER is not FBXL2 (as I had suggested in the original post on the basis of my patent search) but rather FBXL20 (aka FBX2-like). I should have been more suspicious, as FBXL2 has a slighly different size from the protein mentioned in the paper. On the other hand, size differences are not uncommon, most proteins can be found in the databases with at least two differing sizes, typically due to ‘alternative’ splicing or assumption of a different start codon.
Nevertheless, the main point of my post remains unchanged. It was obviously not directed against the NCBI for not providing the database entry, but rather against three common bad habits:
- Not mentioning (or even concealing) the identity of a “novel gene”
- Inventing a new gene name for an already named gene
- Or, if there is good reason for introducing a new name, failure to mention that the “novel gene” is identical to the gene previously named XXX.