One Disappointing Myth Of GSK J4 Uncovered
coli which consists of all of the GEO samples for E. coli), we believe they capture very well what is done in practice to generate large sets of samples. Namely, the samples from many series of experiments are compiled together with little regard for experimental diversity. The typical goal is simply to get ��enough samples�� in order to yield robust estimates of the correlation. Thus, a minimum of 50 samples is typically considered sufficient to compute pairwise gene correlations, with the generally accepted belief that when it comes to repositories ��the bigger the better.�� Thus, while artificial, we feel that partial compendia illustrate precisely how repositories are constructed in practice. Furthermore, while we used partial compendia to illustrate the extent of the problem, it is worth noting that we conducted analyses on the full compendia, which could be viewed as all available samples for these organisms; the S6 Kinase problems we've identified are not merely limited to the partial compendia, as we illustrated in our analysis of the full compendia in conjunction with the operon and pathway data. We also note that the current ��bigger is better�� philosophy does have some merit. With additional samples it is certainly more likely that on or off states will be included in the sample. However, the important caveat is that if a researcher adds substantially more samples, and, for a particular pair of genes, the additional samples have little variation in expression levels for the gene pair of interest, then the observed levels of variability for the gene pair will decrease, and, potentially reduce the level of statistical evidence of underlying association (correlation)��in other words, it is possible for the sampling bias to increase with the addition of more samples. This would happen in a case where additional technical or biological replicates of experiments were added. Importantly, in many procedures and papers utilizing pairwise gene correlations from large sets of gene expression data (Westover et al., 2005; Margolin et al., 2006; Faith et al., 2007; Okuda et al., 2007; Chandrasekaran and Price, 2010), there is an implicit assumption that bigger will always be better. Our proposed gene flagging approach limits errors in downstream inference using these methods which results from analysis of genes showing little change in gene activity across the set of experiments being analyzed. As the field transitions to RNAseq technologies, these problems will not go away. While further study is needed, since RNAseq is merely an alternative way to quantify genome-wide expression levels, in principle these same sampling bias issues, leading to biased estimates of pairwise gene correlation will still exist. Further research is needed to evaluate approaches proposed in this paper, as potentially appropriate for RNAseq samples.