LimROTS: A Hybrid Method Integrating Empirical Bayes and Reproducibility-Optimized Statistics for Robust Differential Expression Analysis

LimROTS: A Hybrid Method Integrating Empirical Bayes and Reproducibility-Optimized Statistics for Robust Differential Expression Analysis

LimROTS(
  x,
  niter = 1000,
  K = NULL,
  a1 = NULL,
  a2 = NULL,
  log = TRUE,
  verbose = TRUE,
  meta.info,
  BPPARAM = NULL,
  group.name,
  formula.str,
  robust = TRUE,
  trend = TRUE,
  permutating.group = FALSE
)

Arguments

x: A SummarizedExperiment object, where rows represent features (e.g., proteins, metabolites) and columns represent samples. The values should be log-transformed.
niter: An integer representing the amount of bootstrap iterations. Default is 1000.
K: An optional integer representing the top list size for ranking. If not specified, it is set to one-fourth of the number of features.
a1: Optional numeric value used in the optimization process. If defined by the user, no optimization occurs.
a2: Optional numeric value used in the optimization process. If defined by the user, no optimization occurs.
log: Logical, indicating whether the data is already log-transformed. Default is TRUE.
verbose: Logical, indicating whether to display messages during the function's execution. Default is TRUE.
meta.info: a character vector of the metadata needed for the model to run and can be retrieved using colData().
BPPARAM: A BiocParallelParam object specifying the parallelization backend (e.g., MulticoreParam, SnowParam). The default depends on the operating system: if the user is on Windows, SnowParam(workers = 2) is used; otherwise, MulticoreParam(workers = 2).
group.name: A string specifying the column in meta.info that represents the groups or conditions for comparison. group.name should be retrieved using colData() as factor.
formula.str: A formula string for modeling. It should include "~ 0 + ..." to exclude the intercept from the model. All the model parameters must be present in meta.info.
robust: indicating whether robust fitting should be used. Default is TRUE, see eBayes.
trend: indicating whether to include trend fitting in the differential expression analysis. Default is TRUE. see eBayes.
permutating.group: Logical, If TRUE, the permutation for calculating the null distribution is performed by permuting the target group only specified in group.name Preserving all the other sample information. If FALSE, the entire sample information retrieved from meta.info will be permuted (recommended to be set to FALSE).

Value

An object of class "SummarizedExperiment" with the following elements:

data: The original data matrix.
niter: The number of bootstrap samples used.
statistics: The optimized statistics for each feature.
logfc: Log-fold change values between groups.
pvalue: P-values computed based on the permutation samples.
FDR: False discovery rate estimates.
a1: Optimized parameter used in differential expression ranking.
a2: Optimized parameter used in differential expression ranking.
k: Top list size used for ranking.
corrected.logfc: estimate of the log2-fold-change corresponding to the effect corrected by the s model see topTable.
q_values: Estimated q-values using the qvalue package.
BH.pvalue: Benjamini-Hochberg adjusted p-values.
null.statistics: The optimized null statistics for each feature.

Details

The LimROTS approach initially uses limma package functionality to simulate the intensity data of proteins and metabolites. A linear model is subsequently fitted using the design matrix. Empirical Bayes variance shrinking is then implemented. To obtain the moderated t-statistics, the adjusted standard error $SE_{post} = \sqrt{s^2_{\text{post}} } \times unscaled SD$ for each feature is computed, along with the regression coefficient for each feature (indicating the impact of variations in the experimental settings). Then, by adapting a reproducibility-optimized technique known as ROTS to establish an optimality based on the largest overlap of top-ranked features within group-preserving bootstrap datasets, Finally based on the optimized parameters $\alpha1$ and $\alpha2$ this equation used to calculates the final statistics:

$$t_{\alpha_{(p)}} = \frac{\beta_{(p)}} {\alpha1 + \alpha2 \times SEpost_{(p)}}$$where $t_{\alpha_{(p)}}$ is the final statistics for each feature, $\beta_{(p)}$ is the coefficient, and $SEpost_{(p)}$ is the the adjusted standard error. LimROTS generates p-values from permutation samples using the implementation available in qvalue package, along with internal implementation of FDR adapted from ROTS package. Additionally, the qvalue package is used to calculate q-values, were the proportion of true null p-values is set to the bootstrap method pi0est. We recommend using permutation-derived p-values and qvalues.

This function processes a dataset using parallel computation. It leverages the BiocParallel framework to distribute tasks across multiple workers, which can significantly reduce runtime for large datasets.

References

Ritchie, M.E., Phipson, B., Wu, D., Hu, Y., Law, C.W., Shi, W., and Smyth, G.K. (2015). limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Research 43(7), e47

Suomi T, Seyednasrollah F, Jaakkola M, Faux T, Elo L (2017). “ROTS: An R package for reproducibility-optimized statistical testing. ” PLoS computational biology, 13(5), e1005562. doi:10.1371/journal.pcbi.1005562 https://doi.org/10.1371/journal.pcbi.1005562, http://www.ncbi.nlm.nih.gov/pubmed/28542205

Elo LL, Filen S, Lahesmaa R, Aittokallio T. Reproducibility-optimized test statistic for ranking genes in microarray studies. IEEE/ACM Trans Comput Biol Bioinform. 2008;5(3):423-431. doi:10.1109/tcbb.2007.1078

Examples

# Example usage:

data <- data.frame(matrix(rnorm(500), nrow = 100, ncol = 10))
# Simulated data
meta.info <- data.frame(
    group = factor(rep(1:2, each = 5)),
    row.names = colnames(data)
)
formula.str <- "~ 0 + group"
result <- LimROTS(data,
    meta.info = meta.info, group.name = "group",
    formula.str = formula.str, niter = 10
)
#> No top list size K given, using 25
#> Initiating limma on bootstrapped samples
#> Using MulticoreParam (Unix-like OS) with two workers.
#> Optimizing a1 and a2
#> Computing p-values and FDR
#> qvalue() failed (return NULL): missing values and NaN's not allowed if 'na.rm' is FALSE