Posted by: Kay at Suicyte | July 23, 2007

Text mining by Microsoft

Yesterday, I mentioned the problems that text miners have when they want to parse gene names out of scientific texts. Maybe they should just do it the Microsoft way. Few people know that Microsoft has incorporated a high-end text parser in their MS-Excel program, which automatically recognizes and corrects gene names. The recognition rate is so high that the user doesn’t even has to be bothered with a confirmation question. Here is how it works:

Text before MS-Excel:

Uniprot-ID	Gene	   Description
SEPT7_HUMAN	SEPT7	   Septin-7, Cdc10 homolog

Text after MS-Excel:

Uniprot-ID	Gene	Description
SEPT7_HUMAN	2007-09-07	Septin-7, Cdc10 homolog

Responses

  1. 🙂

    As I always tell people, just because it’s in rows and columns doesn’t mean you need a spreadsheet.

  2. See also the BMC Bioinformatics paper by Zeeberg et al (2004): “Mistaken identifiers: gene name errors can be introduced inadvertently when using Excel in bioinformatics” (http://www.biomedcentral.com/1471-2105/5/80)

  3. LEaD by excel,
    They thought they excelLED

  4. eheh I was reading to that article last week.
    These kind of artificial intelligence techniques are really impressive.


Leave a reply to Jan Aerts Cancel reply

Categories