Text mining by Microsoft

Posted by: Kay at Suicyte | July 23, 2007

Text mining by Microsoft

Yesterday, I mentioned the problems that text miners have when they want to parse gene names out of scientific texts. Maybe they should just do it the Microsoft way. Few people know that Microsoft has incorporated a high-end text parser in their MS-Excel program, which automatically recognizes and corrects gene names. The recognition rate is so high that the user doesn’t even has to be bothered with a confirmation question. Here is how it works:

Text before MS-Excel:

Uniprot-ID	Gene	   Description
SEPT7_HUMAN	SEPT7	   Septin-7, Cdc10 homolog

Text after MS-Excel:

Uniprot-ID	Gene	Description
SEPT7_HUMAN	2007-09-07	Septin-7, Cdc10 homolog

Posted in bioinformatics, science, silliness | Tags: bioinformatics, Microsoft

Responses

🙂

As I always tell people, just because it’s in rows and columns doesn’t mean you need a spreadsheet.
By: nsaunders on July 24, 2007
at 8:43 am

Reply
See also the BMC Bioinformatics paper by Zeeberg et al (2004): “Mistaken identifiers: gene name errors can be introduced inadvertently when using Excel in bioinformatics” (http://www.biomedcentral.com/1471-2105/5/80)
By: Jan Aerts on July 25, 2007
at 3:47 pm

Reply
LEaD by excel,
They thought they excelLED
By: Animesh on July 26, 2007
at 4:29 am

Reply
eheh I was reading to that article last week.
These kind of artificial intelligence techniques are really impressive.
By: dalloliogm on August 6, 2007
at 10:53 am

Reply

Suicyte Notes