voom method has now been published in Genome Biology
Gordon Smyth is well known for his development of the
limma Bioconductor package for the analysis of differential gene expression using microarrays. More recently his group has led the way in the development of software for the statistical analysis of gene expression using RNA-seq with the
edgeR Bioconductor package and the
voom() method in
limma. Today, Gordon spoke about modeling the variance in RNA-seq data for studying gene expression, in particular using the
voom() method, and contrasted this approach with that taken by
edgeR (and other Poisson/Negative binomial-like methods).
Mean-variance relationship is basically quadratic for RNA-seq counts
Gordon started by showing that the observed variance in RNA-seq experiments is comprised of two components - biological variance (i.e. the interesting stuff) and technical variance (e.g. measurement error). By comparing two biologically identical, but separately sequenced samples, Gordon showed that the technical variation is very well modelled by Poisson variation. However, the variation observed between two different biological samples is greater than Poisson variation; in statistical terminology the biological variation displays overdispersion.
There have been repeated (and misguided) claims in the literature that RNA-seq allows experiments to be performed without biological replicates. I think the reasons why this misguided notion is appealing is because RNA-seq promises to provide more precise estimates of gene expression than was possible with microarrays and that is interpreted as “more precise measurements means less replicates required”. This simply isn’t true - the technical variation may be less in RNA-seq compared to microarrays (debatable) but the large biological variation between samples still remains. This is why it is essential to have biological replicates in order to derive robust inferences from the data.
The overdispersion of the variance is observed as a quadratic mean-variance relationship, as opposed to the linear mean-variance relationship implicit in a Poisson model. All the data analysed by Gordon’s group clearly displays a quadratic mean-variance structure, reinforcing that methods based on a Poisson model will not suffice. This led to the development of the negative binomial model used by
edgeR (similar models of the variance are used by
DESeq and others).
The overdispersion parameter for gene , , must be estimated from the data and there are several ways to do this. Once is estimated the data can be modelled using a generalised linear model (more specifically, a log linear model). One estimator of is the so-called common dispersion model where each gene has the same dispersion value . A second estimator uses empirical Bayesian methods to “squeeze” the gene wise towards the common value . This sort of “squeezing” is necessary for the small sample sizes common to RNA-seq experiments. A third estimator is the trend estimator employed in
DESeq. The estimation of the dispersion parameter, and which estimator to use, is a field of ongoing research.
Modelling the variation is more important than getting the distribution right
edgeR work emphasised the development of exact conditional tests based on the negative binomial distribution. Gordon noted that he now considers this less important than doing a good job of the mean-variance relationship. The
voom() method is being developed to use this alternate angle to tackle differential gene expression analysis from RNA-seq data.
The read counts of gene in sample , , do not follow a Normal distribution, however ,the transformed response variable converges quickly to normality.
voom() models the log counts per million and fits a loess trend line to the scatterplot of variance vs. mean to create weights that are then fed into a standard
The variance model is at the observation level and the loess trend is robust against highly variable genes. Gordon presented the results of a simulation study where
DESeq and several other popular RNA-seq analysis methods. Of course, authors always display simulation results that favour their method, but it seemed like the simulation parameters were designed to favour the negative binomial-based models, such as
edgeR, so I found the simulation results quite interesting.
Gene-specific variation must be accounted for in the model
The idea of gene-specific variances has long been accepted in the microarray world, but is still not widely accepted for the analysis of RNA-seq data. Gordon presented several plots showing that tagwise estimates of dispersion had better goodness of fit statistics than either the common or trend estimates of dispersion.
In short, gene-specific variance exists and must be accounted for in the analysis.
To conclude, Gordon summarised the pros and cons of
- Fits an intuitive model
- The biological coefficient of variation (the biological variance divided by the mean expression) is interpretable
- Excellent statistical power
- It treats the dispersion as known (once estimated) and so can be a little liberal
- Can’t estimate the optimal prior weight (the prior weight is used in the empirical Bayes shrinking of the dispersion estimates)
- Computationally challenging to program (e.g. fitting ~30,000 GLMs, one per gene)
- More agnostic to the mean-variance relationship
- A seemingly natural although somewhat ad-hoc fold change shrinkage
- Easily estimates the prior weight
- Holds its size since it tracks the uncertainty of the empirical Bayes estimates throughout the model
- Feeds into many existing
- Wins all comparisons with other methods so far…
- Still determining…