I normally do not blog about topics related to my daytime job, which involves a lot of microarray data analysis. However, a series of recent blog posts [here, here and here] talk about microarray-related problems that differ so much from my own experiences that I cannot let them go uncommented.
I am the last person to claim that microarrays are a perfect tool for tackling all questions conceivable . They are not. DNA microarrays can be seen as some kind of hammer that is being (rightfully) applied to a few nails, but unfortunately also to lots of objects with no nail-like properties whatsoever. Microarray data are problematic in many different ways. However, we should be careful not to throw out the baby with the bath water.
Here are the main points of criticism that have been raised in the recent posts, along with my comment. I might exaggerate to some extent, but this just serves to make my point more clear.
1) Microarrays are useless, because it has been shown that protein levels correlate poorly with mRNA levels. You hear this argument a lot, especially from the mass-spec people, who want to convince you that only they have a handle on the truth. I admit freely: microarray are mostly useless if you want to learn about protein levels. This is not a nail, go use another tool. You should use microarrays mainly if you are interested in mRNA levels. There are lots of interesting applications for that, e.g. learning which transcription factors are activated. Several stress responses, including those to toxic substances, lead to a dramatic and very specific induction of certain mRNAs. You don’t have to know if the corresponding proteins are really being made, the transcriptional response is the earliest and most specific indicator of many stress conditions. This knowledge can be very useful in its own right. Just don’t try to predict whether there is more protein A in the cell than protein B, just by looking at their microarray signals.
By the way, microarrays are somewhat better in judging changes of protein levels, rather than the protein levels themselves. But still, if protein levels are what you are after, your should turn to another tool.
2) Microrray experiments cannot be trusted because the statistical significance values are wrong. This argument is reiterated here, and the author certainly has a point. Somewhat surprisingly, the examples used in the blog post talk about genetic associations studies rather than the common gene-expression microarrays. There also seems to be some confusion about the numbers of SNPs vs the number of genes. Nevertheless, the main problem is shared between GWAS and transcriptomics studies: a microarray gives you tons of data and chances that one of the genes appears as strongly regulated just by chance alone is substantial. On the other hand, this ‘multiple testing’ problem is well known in the microarray field and is routinely taken into account. There are methods to correct for the bias in p-value (best known is the ‘Bonferroni correction‘), Thus, a situation similar to the one described in the blog post would certainly not reach a p-value of 0.05, at least not in a responsible microarray analysis.
3) Batch effects play a major role and often conceal the real regulation. Admitted, there are batch effects. However, with modern microarray platforms and hybridization methods they can be be safely neglected – at least in comparison to other common noise sources. Obviously, batch effects depend on the technology used. I have experience with three different microarray platforms (two major vendors and and one type developed by the company I work for), and for each of them the batch effects were typically much smaller than the noise from sample preparation or inter-individual differences.
While we are talking about noise sources, here are what I consider the main offenders:
1) Sampling. Particularly problematic when dealing with surgical or biopsy samples. Are you sure that each of your biopsies samples exactly the same tissue structure? With the same relative proportions of cells? Same amount of blood in the tissue samples? Least problematic when comparing things like treated and untreated cell lines.
2) Inter-individual differences. This problem is ofter under-appreciated but is slowly gaining publicity. Most problematic when dealing with human samples or other outbred (animal-)populations. The differences between ‘healthy’ tissue of two donors are often much more pronounced than between ‘healthy’ and ‘diseased’ tissue of the same donor. Less problematic when dealing with imbred strains or cell culture. Even then, there still might be inter-individual differences related to e.g. nutrition status, circadian effects, etc.
3) Extreme amplification protocols. For many microarray studies, the available material is severely limited. Compliance of tissue donors is often inversely correlated with the size of the biopsy needle. There are several protocols for getting sufficient cDNA for microarray analysis out of very small samples, and some of them are clearly better than others. However, all of them share one common problem: less starting material means more dramatic amplification, which in turn means more noise.
Needless to say that most of these problems can be overcome by using really large sample numbers. Unfortunately, this if often impossible due to limited availability of samples or money. As a consequence, we have to live with the shortcomings mentioned above. I usually recommend that microarray results should not be considered the final outcome of an experiment, but rather as a method for identifiying candidate genes that can be used for a more detailed follow-up study.
For the sake of full disclosure: if you haven’t noticed already, I am working for a company that sells microarrays, microarray services and microarray data analysis. Obviously, this affiliation might bias my view of things. Nevertheless, I speak only for myself and not for my employer. I have tried my best to keep this brief discussion as unbiased as possible, it is just meant to reflect my personal experiences from about 10 years of microarray data analysis.