Title: | RNA-Seq Generation/Modification for Simulation |
---|---|
Description: | Generates/modifies RNA-seq data for use in simulations. We provide a suite of functions that will add a known amount of signal to a real RNA-seq dataset. The advantage of using this approach over simulating under a theoretical distribution is that common/annoying aspects of the data are more preserved, giving a more realistic evaluation of your method. The main functions are select_counts(), thin_diff(), thin_lib(), thin_gene(), thin_2group(), thin_all(), and effective_cor(). See Gerard (2020) <doi:10.1186/s12859-020-3450-9> for details on the implemented methods. |
Authors: | David Gerard [aut, cre] |
Maintainer: | David Gerard <[email protected]> |
License: | GPL-3 |
Version: | 1.2.4 |
Built: | 2024-12-14 04:37:51 UTC |
Source: | https://github.com/dcgerard/seqgendiff |
This package is designed to take real RNA-seq data and alter it by adding a known amount of signal. You can then use this modified dataset in simulation studies for differential expression analysis, factor analysis, confounder adjustment, or library size adjustment. The advantage of this way of simulating data is that you can see how your method behaves when the simulated data exhibit common (and annoying) features of real data. For example, in the real world data are not normally (or negative binomially) distributed and unobserved confounding is a major issue. This package will simulate data that exhibit these characteristics. The methods used in this package are described in detail in Gerard (2020).
select_counts()
Subsample the columns and rows of a real RNA-seq count matrix. You would then feed this sub-matrix into one of the thinning functions below.
thin_diff()
The function most users should be using for general-purpose binomial thinning. For the special applications of the two-group model or library/gene thinning, see the functions listed below.
thin_2group()
The specific application of thinning in the two-group model.
thin_lib()
The specific application of library size thinning.
thin_gene()
The specific application of total gene expression thinning.
thin_all()
The specific application of thinning all counts.
effective_cor()
Returns an estimate of the actual correlation between surrogate variables and a user-specified design matrix.
ThinDataToSummarizedExperiment()
Converts a ThinData object to a SummarizedExperiment object.
ThinDataToDESeqDataSet()
Converts a ThinData object to a DESeqDataSet object.
David Gerard
Gerard, D (2020). "Data-based RNA-seq simulations by binomial thinning." BMC Bioinformatics. 21(1), 206. doi:10.1186/s12859-020-3450-9.
Useful links:
We extract latent factors from the log of mat
using an SVD, then
generate an underlying group-assignment variable from a conditional
normal distribution (conditional on the latent factors). This underlying
group-assignment variable is used to assign groups.
corassign(mat, nfac = NULL, corvec = NULL, return = c("group", "full"))
corassign(mat, nfac = NULL, corvec = NULL, return = c("group", "full"))
mat |
A matrix of count data. The rows index the individuals and the columns index the genes. |
nfac |
The number of latent factors. If |
corvec |
The vector of correlations. |
return |
What should we return? Just the group assignment
( |
If nfac
is provided, then corvec
must be the same length as nfac
.
If nfac
is not provided, then it is assumed that the first nfac
elements of corvec
are the underlying correlations, if nfac
turns out to be
smaller than the length of corvec
. If nfac
turns
out to be larger than the length of corvec
, then the factors without
defined correlations are assumed to have correlation 0.
A list with some or all of the following elements:
x
The vector of group assignments. 0L
indicates
membership to one group and 1L
indicates membership to
the other group.
nfac
The number of assumed latent factors.
facmat
A matrix, whose columns contain the latent factors.
groupfac
The underlying group-assignment factor.
corvec
The correlation vector. Note that this is the
correlation between random variables observed in groupfac
and facmat
,
If return = "group"
, then the list only contains x
.
David Gerard
A. Onatski (2010), Determining the number of factors from empirical distribution of eigenvalues. The Review of Economics and Statistics 92(4).
## Simulate data from given matrix of counts ## In practice, you would obtain Y from a real dataset, not simulate it. set.seed(1) nsamp <- 1000 ngene <- 10 Y <- matrix(stats::rpois(nsamp * ngene, lambda = 50), nrow = ngene) ## Set target correlation to be 0.9 and nfac to be 1 corvec <- 0.9 nfac <- 1 ## Group assignment cout <- corassign(mat = t(Y), nfac = nfac, corvec = corvec, return = "full") ## Correlation between facmat and groupfac should be about 0.9 cor(cout$facmat, cout$groupfac) ## Correlation between facmat and x should be about 0.9 * sqrt(2 / pi) cor(cout$facmat, cout$x) corvec * sqrt(2 / pi)
## Simulate data from given matrix of counts ## In practice, you would obtain Y from a real dataset, not simulate it. set.seed(1) nsamp <- 1000 ngene <- 10 Y <- matrix(stats::rpois(nsamp * ngene, lambda = 50), nrow = ngene) ## Set target correlation to be 0.9 and nfac to be 1 corvec <- 0.9 nfac <- 1 ## Group assignment cout <- corassign(mat = t(Y), nfac = nfac, corvec = corvec, return = "full") ## Correlation between facmat and groupfac should be about 0.9 cor(cout$facmat, cout$groupfac) ## Correlation between facmat and x should be about 0.9 * sqrt(2 / pi) cor(cout$facmat, cout$x) corvec * sqrt(2 / pi)
Will return the estimated correlation between the design matrix and the surrogate variables when you assign a target correlation. The method is described in detail in Gerard (2020).
effective_cor( design_perm, sv, target_cor, calc_first = c("cor", "mean"), method = c("hungarian", "marriage"), iternum = 1000 )
effective_cor( design_perm, sv, target_cor, calc_first = c("cor", "mean"), method = c("hungarian", "marriage"), iternum = 1000 )
design_perm |
A numeric design matrix whose rows are to be permuted (thus controlling the amount by which they are correlated with the surrogate variables). The rows index the samples and the columns index the variables. The intercept should not be included (though see Section "Unestimable Components"). |
sv |
A matrix of surrogate variables |
target_cor |
A numeric matrix of target correlations between the
variables in |
calc_first |
Should we calculate the correlation of the mean
|
method |
Should we use the Gale-Shapley algorithm
for stable marriages ( |
iternum |
The total number of simulated correlations to consider. |
This function permutes the rows of design_perm
many times, each
time calculating the Pearson correlation between the columns of
design_perm
and the columns of sv
. It then returns the
averages of these Pearson correlations. The permutation is done
using permute_design
.
A matrix of correlations. The rows index the observed covariates
and the columns index the surrogate variables. Element (i, j) is
the estimated correlation between the ith variable in
design_perm
and the jth variable in sv
.
David Gerard
Gale, David, and Lloyd S. Shapley. "College admissions and the stability of marriage." The American Mathematical Monthly 69, no. 1 (1962): 9-15. doi:10.1080/00029890.1962.11989827.
Gerard, D (2020). "Data-based RNA-seq simulations by binomial thinning." BMC Bioinformatics. 21(1), 206. doi:10.1186/s12859-020-3450-9.
Hornik K (2005). "A CLUE for CLUster Ensembles." Journal of Statistical Software, 14(12). doi:10.18637/jss.v014.i12. doi:10.18637/jss.v014.i12.
C. Papadimitriou and K. Steiglitz (1982), Combinatorial Optimization: Algorithms and Complexity. Englewood Cliffs: Prentice Hall.
## Generate the design matrices and set target correlation ----------------- n <- 10 design_perm <- cbind(rep(c(0, 1), each = n / 2), rep(c(0, 1), length.out = n)) sv <- matrix(rnorm(n)) target_cor <- matrix(c(0.9, 0.1), ncol = 1) ## Get estimated true correlation ------------------------------------------ ## You should use a much larger iternum in practice effective_cor(design_perm = design_perm, sv = sv, target_cor = target_cor, iternum = 10)
## Generate the design matrices and set target correlation ----------------- n <- 10 design_perm <- cbind(rep(c(0, 1), each = n / 2), rep(c(0, 1), length.out = n)) sv <- matrix(rnorm(n)) target_cor <- matrix(c(0.9, 0.1), ncol = 1) ## Get estimated true correlation ------------------------------------------ ## You should use a much larger iternum in practice effective_cor(design_perm = design_perm, sv = sv, target_cor = target_cor, iternum = 10)
This will use either sva
or an SVD on the residuals
of a regression of mat
on design_obs
to estimate the
surrogate variables.
est_sv(mat, n_sv, design_obs, use_sva = FALSE)
est_sv(mat, n_sv, design_obs, use_sva = FALSE)
mat |
A numeric matrix of RNA-seq counts. The rows index the genes and the columns index the samples. |
n_sv |
The number of surrogate variables. |
design_obs |
A numeric matrix of observed covariates that are NOT to
be a part of the signal generating process. Only used in estimating the
surrogate variables (if |
use_sva |
A logical. Should we use surrogate variable analysis
(Leek and Storey, 2008) using |
A matrix of estimated surrogate variables. The columns index the surrogate variables and the rows index the individuals. The surrogate variables are centered and scaled to have mean 0 and variance 1.
David Gerard
Shrinks the target correlation using a uniform scaling factor so that the overall correlation matrix is positive semi-definite. The method is described in detail in Gerard (2020).
fix_cor(design_perm, target_cor, num_steps = 51)
fix_cor(design_perm, target_cor, num_steps = 51)
design_perm |
A numeric design matrix whose rows are to be permuted (thus controlling the amount by which they are correlated with the surrogate variables). The rows index the samples and the columns index the variables. The intercept should not be included (though see Section "Unestimable Components"). |
target_cor |
A numeric matrix of target correlations between the
variables in |
num_steps |
The number of steps between 0 and 1 to take in the
grid search for the shrinkage factor. The step-size would be
|
Let =
cor(design_perm)
. Let =
target_cor
.
Then the overall correlation matrix is:
This function applies a multiplicative scaling factor to until
the above matrix is positive semi-definite. That is, it finds
between 0 and 1 such that
is positive semi-definite.
A matrix of correlations the same dimension as target_cor
.
Actually, the returned matrix is a * target_cor
, where a
was determined to make the overall correlation matrix positive
semi-definite.
David Gerard
Gerard, D (2020). "Data-based RNA-seq simulations by binomial thinning." BMC Bioinformatics. 21(1), 206. doi:10.1186/s12859-020-3450-9.
n <- 10 design_perm <- matrix(rep(c(0, 1), length.out = n)) target_cor <- matrix(seq(1, 0, length.out = 10), nrow = 1) new_cor <- seqgendiff:::fix_cor(design_perm = design_perm, target_cor = target_cor) new_cor / target_cor ## In the case of one observed covariate, the requirement is just that ## the sum of squared correlations is less than or equal to one. sum(target_cor ^ 2) sum(new_cor ^ 2)
n <- 10 design_perm <- matrix(rep(c(0, 1), length.out = n)) target_cor <- matrix(seq(1, 0, length.out = 10), nrow = 1) new_cor <- seqgendiff:::fix_cor(design_perm = design_perm, target_cor = target_cor) new_cor / target_cor ## In the case of one observed covariate, the requirement is just that ## the sum of squared correlations is less than or equal to one. sum(target_cor ^ 2) sum(new_cor ^ 2)
Permute the design matrix so that it is approximately correlated with the surrogate variables.
permute_design( design_perm, sv, target_cor, method = c("hungarian", "marriage") )
permute_design( design_perm, sv, target_cor, method = c("hungarian", "marriage") )
design_perm |
A numeric design matrix whose rows are to be permuted (thus controlling the amount by which they are correlated with the surrogate variables). The rows index the samples and the columns index the variables. The intercept should not be included (though see Section "Unestimable Components"). |
sv |
A matrix of surrogate variables |
target_cor |
A numeric matrix of target correlations between the
variables in |
method |
Should we use the Gale-Shapley algorithm
for stable marriages ( |
A list with two elements:
design_perm
A row-permuted version of the user-provided
design_perm
.
latent_var
A matrix of the latent variables on which
design_perm
was matched.
David Gerard
Gale, David, and Lloyd S. Shapley. "College admissions and the stability of marriage." The American Mathematical Monthly 69, no. 1 (1962): 9-15. doi:10.1080/00029890.1962.11989827.
C. Papadimitriou and K. Steiglitz (1982), Combinatorial Optimization: Algorithms and Complexity. Englewood Cliffs: Prentice Hall.
Hornik K (2005). "A CLUE for CLUster Ensembles." Journal of Statistical Software, 14(12). doi:10.18637/jss.v014.i12. doi:10.18637/jss.v014.i12.
This is now defunct. Please try out select_counts
and
thin_2group
.
poisthin( mat, nsamp = nrow(mat), ngene = ncol(mat), gselect = c("max", "random", "rand_max", "custom", "mean_max"), gvec = NULL, skip_gene = 0L, signal_fun = stats::rnorm, signal_params = list(mean = 0, sd = 1), prop_null = 1, alpha = 0, group_assign = c("frac", "random", "cor"), group_prop = 0.5, corvec = NULL )
poisthin( mat, nsamp = nrow(mat), ngene = ncol(mat), gselect = c("max", "random", "rand_max", "custom", "mean_max"), gvec = NULL, skip_gene = 0L, signal_fun = stats::rnorm, signal_params = list(mean = 0, sd = 1), prop_null = 1, alpha = 0, group_assign = c("frac", "random", "cor"), group_prop = 0.5, corvec = NULL )
mat |
A matrix of count data. The rows index the individuals and the columns index the genes. |
nsamp |
The number of samples to select from |
ngene |
The number of genes to select from |
gselect |
How should we select the subset of genes? Should we choose
the |
gvec |
A logical of length |
skip_gene |
The number of maximally expressed genes to skip.
Not used if |
signal_fun |
A function that returns the signal. This should take as
input |
signal_params |
A list of additional arguments to pass to |
prop_null |
The proportion of genes that are null. |
alpha |
If |
group_assign |
How should we assign groups? Exactly specifying the
proportion of individuals in each group ( |
group_prop |
The proportion of individuals that are in group 1.
This proportion is deterministic if |
corvec |
A vector of correlations. |
Given a matrix of RNA-seq counts, this function will randomly select two groups of samples and add signal to a known proportion of the genes. This signal is the log (base 2) effect size of the group indicator in a linear model. The user may specify the distribution of the effects.
The Poisson thinning approach first randomly assigns samples to be in one of two groups. Then, given this assignment, will Binomially sample counts with a sample size of the gene expression counts and a probability that is a function of the effect size. For details, see Gerard and Stephens (2021).
A list with the following elements:
Y
A matrix of altered counts with nsamp
rows
and ngene
columns.
X
A design matrix. The first column contains a vector ones (for an intercept term) and the second column contains an indicator for group membership.
beta
The approximately true effect sizes of .
corassign
The output from the call to corassign
.
Only returned if group_assign = "cor"
.
David Gerard
Gerard, D., and Stephens, M. (2021). "Unifying and Generalizing Methods for Removing Unwanted Variation Based on Negative Controls." Statistica Sinica, 31(3), 1145-1166 doi:10.5705/ss.202018.0345.
## Simulate data from given matrix of counts ## In practice, you would obtain Y from a real dataset, not simulate it. set.seed(1) nsamp <- 10 ngene <- 1000 Y <- matrix(stats::rpois(nsamp * ngene, lambda = 50), nrow = ngene) ## Apply thinning poisout <- poisthin(mat = t(Y), nsamp = 9, ngene = 999, signal_fun = stats::rnorm, signal_params = list(mean = 0, sd = 1), prop_null = 0.9) ## Dimension of count matrix is smaller. dim(poisout$Y) ## Can verify signal was added by estimating it with lm(). betahat <- coef(lm(log2(poisout$Y + 1) ~ poisout$X[, 2]))[2, ] plot(poisout$beta, betahat, xlab = "Coefficients", ylab = "Estimates") abline(0, 1, col = 2, lty = 2)
## Simulate data from given matrix of counts ## In practice, you would obtain Y from a real dataset, not simulate it. set.seed(1) nsamp <- 10 ngene <- 1000 Y <- matrix(stats::rpois(nsamp * ngene, lambda = 50), nrow = ngene) ## Apply thinning poisout <- poisthin(mat = t(Y), nsamp = 9, ngene = 999, signal_fun = stats::rnorm, signal_params = list(mean = 0, sd = 1), prop_null = 0.9) ## Dimension of count matrix is smaller. dim(poisout$Y) ## Can verify signal was added by estimating it with lm(). betahat <- coef(lm(log2(poisout$Y + 1) ~ poisout$X[, 2]))[2, ] plot(poisout$beta, betahat, xlab = "Coefficients", ylab = "Estimates") abline(0, 1, col = 2, lty = 2)
It is a good idea to subsample (each iteration) the genes and samples from
a real RNA-seq dataset prior to applying thin_diff
(and related functions) so that your conclusions are not dependent on the
specific structure of your dataset. This function is designed to efficiently
do this for you.
select_counts( mat, nsamp = ncol(mat), ngene = nrow(mat), gselect = c("random", "max", "mean_max", "custom"), gvec = NULL, filter_first = FALSE, nskip = 0L )
select_counts( mat, nsamp = ncol(mat), ngene = nrow(mat), gselect = c("random", "max", "mean_max", "custom"), gvec = NULL, filter_first = FALSE, nskip = 0L )
mat |
A numeric matrix of RNA-seq counts. The rows index the genes and the columns index the samples. |
nsamp |
The number of samples (columns) to select from |
ngene |
The number of genes (rows) to select from |
gselect |
How should we select the subset of genes? Options include:
|
gvec |
A logical vector of length |
filter_first |
Should we first filter genes by the method of
Chen et al. (2016) ( |
nskip |
The number of median-maximally expressed genes to skip.
Not used if |
The samples (columns) are chosen randomly, with each sample having
an equal probability of being in the sub-matrix. The genes are selected
according to one of four schemes (see the description of the gselect
argument).
If you have edgeR installed, then some functionality is provided for
filtering out the lowest expressed genes prior to applying subsampling
(see the filter_first
argument).
This filtering scheme is described in Chen et al. (2016).
If you want more control over this filtering, you should use
the filterByExpr
function from edgeR directly. You
can install edgeR by following instructions at
doi:10.18129/B9.bioc.edgeR.
A numeric matrix, which is a ngene
by nsamp
sub-matrix
of mat
. If rownames(mat)
is NULL
, then the
row names of the returned matrix are the indices in mat
of the
selected genes. If colnames(mat)
is NULL
, then the
column names of the returned matrix are the indices in mat
of
the selected samples.
David Gerard
Chen, Yunshun, Aaron TL Lun, and Gordon K. Smyth. "From reads to genes to pathways: differential expression analysis of RNA-Seq experiments using Rsubread and the edgeR quasi-likelihood pipeline." F1000Research 5 (2016). doi:10.12688/f1000research.8987.2.
## Simulate data from given matrix of counts ## In practice, you would obtain mat from a real dataset, not simulate it. set.seed(1) n <- 100 p <- 1000 mat <- matrix(stats::rpois(n * p, lambda = 50), nrow = p) ## Subsample the matrix, then feed it into a thinning function submat <- select_counts(mat = mat, nsamp = 10, ngene = 100) thout <- thin_2group(mat = submat, prop_null = 0.5) ## The rownames and colnames (if NULL in mat) tell you which genes/samples ## were selected. rownames(submat) colnames(submat)
## Simulate data from given matrix of counts ## In practice, you would obtain mat from a real dataset, not simulate it. set.seed(1) n <- 100 p <- 1000 mat <- matrix(stats::rpois(n * p, lambda = 50), nrow = p) ## Subsample the matrix, then feed it into a thinning function submat <- select_counts(mat = mat, nsamp = 10, ngene = 100) thout <- thin_2group(mat = submat, prop_null = 0.5) ## The rownames and colnames (if NULL in mat) tell you which genes/samples ## were selected. rownames(submat) colnames(submat)
Provide summary output of a ThinData S3 object.
## S3 method for class 'ThinData' summary(object, ...)
## S3 method for class 'ThinData' summary(object, ...)
object |
A ThinData S3 object. This is generally output by either
|
... |
Not used. |
Returns nothing. Prints out some summary information on
object
.
David Gerard
Given a matrix of real RNA-seq counts, this function will
randomly assign samples to one of two groups, draw
the log2-fold change in expression between two groups for each gene,
and add this signal to the RNA-seq counts matrix. The user may specify
the proportion of samples in each group, the proportion of null genes
(where the log2-fold change is 0),
and the signal function. This is a specific application of the
general binomial thinning approach implemented in thin_diff
.
thin_2group( mat, prop_null = 1, signal_fun = stats::rnorm, signal_params = list(mean = 0, sd = 1), group_prop = 0.5, corvec = NULL, alpha = 0, permute_method = c("hungarian", "marriage"), type = c("thin", "mult") )
thin_2group( mat, prop_null = 1, signal_fun = stats::rnorm, signal_params = list(mean = 0, sd = 1), group_prop = 0.5, corvec = NULL, alpha = 0, permute_method = c("hungarian", "marriage"), type = c("thin", "mult") )
mat |
A numeric matrix of RNA-seq counts. The rows index the genes and the columns index the samples. |
prop_null |
The proportion of genes that are null. |
signal_fun |
A function that returns the signal. This should take as
input |
signal_params |
A list of additional arguments to pass to
|
group_prop |
The proportion of individuals that are in group 1. |
corvec |
A vector of target correlations. |
alpha |
The scaling factor for the signal distribution from
Stephens (2016). If |
permute_method |
Should we use the Gale-Shapley algorithm
for stable marriages ( |
type |
Should we apply binomial thinning ( |
The specific application of binomial thinning to the two-group model was used in Gerard and Stephens (2018) and Gerard and Stephens (2021). This is a specific case of the general method described in Gerard (2020).
A list-like S3 object of class ThinData
.
Components include some or all of the following:
mat
The modified matrix of counts.
designmat
The design matrix of variables used to simulate
signal. This is made by column-binding design_fixed
and the
permuted version of design_perm
.
coefmat
A matrix of coefficients corresponding to
designmat
.
design_obs
Additional variables that should be included in
your design matrix in downstream fittings. This is made by
column-binding the vector of 1's with design_obs
.
sv
A matrix of estimated surrogate variables. In simulation studies you would probably leave this out and estimate your own surrogate variables.
cormat
A matrix of target correlations between the
surrogate variables and the permuted variables in the design matrix.
This might be different from the target_cor
you input because
we pass it through fix_cor
to ensure
positive semi-definiteness of the resulting covariance matrix.
matching_var
A matrix of simulated variables used to
permute design_perm
if the target_cor
is not
NULL
.
David Gerard
Gale, David, and Lloyd S. Shapley. "College admissions and the stability of marriage." The American Mathematical Monthly 69, no. 1 (1962): 9-15. doi:10.1080/00029890.1962.11989827.
Gerard, D., and Stephens, M. (2021). "Unifying and Generalizing Methods for Removing Unwanted Variation Based on Negative Controls." Statistica Sinica, 31(3), 1145-1166 doi:10.5705/ss.202018.0345.
David Gerard and Matthew Stephens (2018). "Empirical Bayes shrinkage and false discovery rate estimation, allowing for unwanted variation." Biostatistics, doi:10.1093/biostatistics/kxy029.
Gerard, D (2020). "Data-based RNA-seq simulations by binomial thinning." BMC Bioinformatics. 21(1), 206. doi:10.1186/s12859-020-3450-9.
Hornik K (2005). "A CLUE for CLUster Ensembles." Journal of Statistical Software, 14(12). doi:10.18637/jss.v014.i12. doi:10.18637/jss.v014.i12.
C. Papadimitriou and K. Steiglitz (1982), Combinatorial Optimization: Algorithms and Complexity. Englewood Cliffs: Prentice Hall.
Stephens, Matthew. "False discovery rates: a new deal." Biostatistics 18, no. 2 (2016): 275-294. doi:10.1093/biostatistics/kxw041.
Wakefield, Jon. "Bayes factors for genome-wide association studies: comparison with P-values." Genetic epidemiology 33, no. 1 (2009): 79-86. doi:10.1002/gepi.20359.
select_counts
For subsampling the rows and columns of your real RNA-seq count matrix prior to applying binomial thinning.
thin_diff
For the more general thinning approach.
ThinDataToSummarizedExperiment
For converting a ThinData object to a SummarizedExperiment object.
ThinDataToDESeqDataSet
For converting a ThinData object to a DESeqDataSet object.
## Simulate data from given matrix of counts ## In practice, you would obtain Y from a real dataset, not simulate it. set.seed(1) nsamp <- 10 ngene <- 1000 Y <- matrix(stats::rpois(nsamp * ngene, lambda = 50), nrow = ngene) thinout <- thin_2group(mat = Y, prop_null = 0.9, signal_fun = stats::rexp, signal_params = list(rate = 0.5)) ## 90 percent of genes are null mean(abs(thinout$coef) < 10^-6) ## Check the estimates of the log2-fold change Ynew <- log2(t(thinout$mat + 0.5)) X <- thinout$designmat Bhat <- coef(lm(Ynew ~ X))["X", ] plot(thinout$coefmat, Bhat, xlab = "log2-fold change", ylab = "Estimated log2-fold change") abline(0, 1, col = 2, lwd = 2)
## Simulate data from given matrix of counts ## In practice, you would obtain Y from a real dataset, not simulate it. set.seed(1) nsamp <- 10 ngene <- 1000 Y <- matrix(stats::rpois(nsamp * ngene, lambda = 50), nrow = ngene) thinout <- thin_2group(mat = Y, prop_null = 0.9, signal_fun = stats::rexp, signal_params = list(rate = 0.5)) ## 90 percent of genes are null mean(abs(thinout$coef) < 10^-6) ## Check the estimates of the log2-fold change Ynew <- log2(t(thinout$mat + 0.5)) X <- thinout$designmat Bhat <- coef(lm(Ynew ~ X))["X", ] plot(thinout$coefmat, Bhat, xlab = "log2-fold change", ylab = "Estimated log2-fold change") abline(0, 1, col = 2, lwd = 2)
Given a matrix of real RNA-seq counts, this function will apply a
thinning factor uniformly to every count in this matrix. This uniformly
lowers the read-depth for the entire dataset. The thinning factor should
be provided on the log2-scale. This is a specific application of the
binomial thinning approach in thin_diff
. Though this particular
form of thinning was used by Robinson and Storey (2014) in the context
of deriving read-depth suggestions. It is also
described in detail in Gerard (2020).
thin_all(mat, thinlog2, type = c("thin", "mult"))
thin_all(mat, thinlog2, type = c("thin", "mult"))
mat |
A numeric matrix of RNA-seq counts. The rows index the genes and the columns index the samples. |
thinlog2 |
A numeric scalar. This is the amount to shrink each count
in |
type |
Should we apply binomial thinning ( |
A list-like S3 object of class ThinData
.
Components include some or all of the following:
mat
The modified matrix of counts.
designmat
The design matrix of variables used to simulate
signal. This is made by column-binding design_fixed
and the
permuted version of design_perm
.
coefmat
A matrix of coefficients corresponding to
designmat
.
design_obs
Additional variables that should be included in
your design matrix in downstream fittings. This is made by
column-binding the vector of 1's with design_obs
.
sv
A matrix of estimated surrogate variables. In simulation studies you would probably leave this out and estimate your own surrogate variables.
cormat
A matrix of target correlations between the
surrogate variables and the permuted variables in the design matrix.
This might be different from the target_cor
you input because
we pass it through fix_cor
to ensure
positive semi-definiteness of the resulting covariance matrix.
matching_var
A matrix of simulated variables used to
permute design_perm
if the target_cor
is not
NULL
.
David Gerard
Gerard, D (2020). "Data-based RNA-seq simulations by binomial thinning." BMC Bioinformatics. 21(1), 206. doi:10.1186/s12859-020-3450-9.
Robinson, David G., and John D. Storey. "subSeq: determining appropriate sequencing depth through efficient read subsampling." Bioinformatics 30, no. 23 (2014): 3424-3426. doi:10.1093/bioinformatics/btu552.
select_counts
For subsampling the rows and columns of your real RNA-seq count matrix prior to applying binomial thinning.
thin_diff
For the more general thinning approach.
thin_lib
For thinning sample-wise.
thin_gene
For thinning gene-wise.
ThinDataToSummarizedExperiment
For converting a ThinData object to a SummarizedExperiment object.
ThinDataToDESeqDataSet
For converting a ThinData object to a DESeqDataSet object.
## Generate count data and set thinning factor ## In practice, you would obtain mat from a real dataset, not simulate it. set.seed(1) n <- 10 p <- 1000 lambda <- 1000 mat <- matrix(lambda, ncol = n, nrow = p) thinlog2 <- 1 ## Thin read-depths thout <- thin_all(mat = mat, thinlog2 = thinlog2) ## Compare empirical and theoretical proportions mean(thout$mat) / lambda 2 ^ -thinlog2
## Generate count data and set thinning factor ## In practice, you would obtain mat from a real dataset, not simulate it. set.seed(1) n <- 10 p <- 1000 lambda <- 1000 mat <- matrix(lambda, ncol = n, nrow = p) thinlog2 <- 1 ## Thin read-depths thout <- thin_all(mat = mat, thinlog2 = thinlog2) ## Compare empirical and theoretical proportions mean(thout$mat) / lambda 2 ^ -thinlog2
Given a matrix of counts () where
,
a design matrix (
), and a matrix of coefficients (
),
thin_diff
will generate a new matrix of counts such that
, where
is some vector
of intercept coefficients. This function is used by all other
thinning functions. The method is
described in detail in Gerard (2020).
thin_base(mat, designmat, coefmat, relative = TRUE, type = c("thin", "mult"))
thin_base(mat, designmat, coefmat, relative = TRUE, type = c("thin", "mult"))
mat |
A numeric matrix of RNA-seq counts. The rows index the genes and the columns index the samples. |
designmat |
A design matrix. The rows index the samples and the columns index the variables. The intercept should not be included. |
coefmat |
A matrix of coefficients. The rows index the genes and the columns index the samples. |
relative |
A logical. Should we apply relative thinning ( |
type |
Should we apply binomial thinning ( |
A matrix of new RNA-seq read-counts. This matrix has the signal
added from designmat
and coefmat
.
David Gerard
Gerard, D (2020). "Data-based RNA-seq simulations by binomial thinning." BMC Bioinformatics. 21(1), 206. doi:10.1186/s12859-020-3450-9.
select_counts
For subsampling the rows and columns of your real RNA-seq count matrix prior to applying binomial thinning.
thin_diff
For the function most users should be using for general-purpose binomial thinning.
thin_2group
For the specific application of thinning in the two-group model.
thin_lib
For the specific application of library size thinning.
thin_gene
For the specific application of total gene expression thinning.
thin_all
For the specific application of thinning all counts uniformly.
## Simulate data from given matrix of counts ## In practice, you would obtain Y from a real dataset, not simulate it. set.seed(1) nsamp <- 10 ngene <- 1000 Y <- matrix(stats::rpois(nsamp * ngene, lambda = 100), nrow = ngene) X <- matrix(rep(c(0, 1), length.out = nsamp)) B <- matrix(seq(3, 0, length.out = ngene)) Ynew <- thin_base(mat = Y, designmat = X, coefmat = B) ## Demonstrate how the log2 effect size is B Bhat <- coefficients(lm(t(log2(Ynew)) ~ X))["X", ] plot(B, Bhat, xlab = "Coefficients", ylab = "Coefficient Estimates") abline(0, 1, col = 2, lwd = 2)
## Simulate data from given matrix of counts ## In practice, you would obtain Y from a real dataset, not simulate it. set.seed(1) nsamp <- 10 ngene <- 1000 Y <- matrix(stats::rpois(nsamp * ngene, lambda = 100), nrow = ngene) X <- matrix(rep(c(0, 1), length.out = nsamp)) B <- matrix(seq(3, 0, length.out = ngene)) Ynew <- thin_base(mat = Y, designmat = X, coefmat = B) ## Demonstrate how the log2 effect size is B Bhat <- coefficients(lm(t(log2(Ynew)) ~ X))["X", ] plot(B, Bhat, xlab = "Coefficients", ylab = "Coefficient Estimates") abline(0, 1, col = 2, lwd = 2)
Given a matrix of real RNA-seq counts, this function will add a known amount of signal to the count matrix. This signal is given in the form of a Poisson / negative binomial / mixture of negative binomials generalized linear model with a log (base 2) link. The user may specify any arbitrary design matrix and coefficient matrix. The user may also control for the amount of correlation between the observed covariates and any unobserved surrogate variables. The method is described in detail in Gerard (2020).
thin_diff( mat, design_fixed = NULL, coef_fixed = NULL, design_perm = NULL, coef_perm = NULL, target_cor = NULL, use_sva = FALSE, design_obs = NULL, relative = TRUE, change_colnames = TRUE, permute_method = c("hungarian", "marriage"), type = c("thin", "mult") )
thin_diff( mat, design_fixed = NULL, coef_fixed = NULL, design_perm = NULL, coef_perm = NULL, target_cor = NULL, use_sva = FALSE, design_obs = NULL, relative = TRUE, change_colnames = TRUE, permute_method = c("hungarian", "marriage"), type = c("thin", "mult") )
mat |
A numeric matrix of RNA-seq counts. The rows index the genes and the columns index the samples. |
design_fixed |
A numeric design matrix whose rows are fixed and not to be permuted. The rows index the samples and the columns index the variables. The intercept should not be included (though see Section "Unestimable Components"). |
coef_fixed |
A numeric matrix. The coefficients corresponding to
|
design_perm |
A numeric design matrix whose rows are to be permuted (thus controlling the amount by which they are correlated with the surrogate variables). The rows index the samples and the columns index the variables. The intercept should not be included (though see Section "Unestimable Components"). |
coef_perm |
A numeric matrix. The coefficients corresponding to
|
target_cor |
A numeric matrix of target correlations between the
variables in |
use_sva |
A logical. Should we use surrogate variable analysis
(Leek and Storey, 2008) using |
design_obs |
A numeric matrix of observed covariates that are NOT to
be a part of the signal generating process. Only used in estimating the
surrogate variables (if |
relative |
A logical. Should we apply relative thinning ( |
change_colnames |
A logical. Should we change the column-names
of the design matrices ( |
permute_method |
Should we use the Gale-Shapley algorithm
for stable marriages ( |
type |
Should we apply binomial thinning ( |
A list-like S3 object of class ThinData
.
Components include some or all of the following:
mat
The modified matrix of counts.
designmat
The design matrix of variables used to simulate
signal. This is made by column-binding design_fixed
and the
permuted version of design_perm
.
coefmat
A matrix of coefficients corresponding to
designmat
.
design_obs
Additional variables that should be included in
your design matrix in downstream fittings. This is made by
column-binding the vector of 1's with design_obs
.
sv
A matrix of estimated surrogate variables. In simulation studies you would probably leave this out and estimate your own surrogate variables.
cormat
A matrix of target correlations between the
surrogate variables and the permuted variables in the design matrix.
This might be different from the target_cor
you input because
we pass it through fix_cor
to ensure
positive semi-definiteness of the resulting covariance matrix.
matching_var
A matrix of simulated variables used to
permute design_perm
if the target_cor
is not
NULL
.
Let
Be the number of samples.
Be the number of genes.
Be an by
matrix of real RNA-seq counts.
This is
mat
.
Be an by
user-provided design matrix.
This is
design_fixed
.
Be an by
user-provided design matrix.
This is
design_perm
.
Be an by
matrix of known covariates.
This is
design_obs
.
Be an by
matrix of unobserved surrogate
variables. This is estimated when
target_cor
is not
NULL
.
Be a by
of additional (unknown)
unwanted variation.
We assume that is Poisson distributed given
and
such that
thin_diff()
will take as input ,
,
,
, and will output a
and
such that
is Poisson distributed given
,
,
,
,
, and
such that
where is an
by
permutation matrix.
is randomly
drawn so that
and
are correlated approximately according
to the target correlation matrix.
The Poisson assumption may be generalized to a mixture of negative binomials.
It is possible to include an intercept term or a column from
design_obs
into either design_fixed
or design_perm
.
This will not produce an error and the specified thinning will be applied.
However, If any column of design_fixed
or
design_perm
is a vector of ones or contains a column from
design_obs
, then the corresponding columns in coef_fixed
or coef_perm
cannot be estimated by any method. This is
represented in the output by having duplicate columns in
designmat
and design_obs
.
Including duplicate columns in design_fixed
and design_perm
is also allowed but, again, will produce unestimable coefficients.
Including an intercept term in design_obs
will produce an error if
you are specifying correlated surrogate variables.
David Gerard
Gale, David, and Lloyd S. Shapley. "College admissions and the stability of marriage." The American Mathematical Monthly 69, no. 1 (1962): 9-15. doi:10.1080/00029890.1962.11989827.
Gerard, D (2020). "Data-based RNA-seq simulations by binomial thinning." BMC Bioinformatics. 21(1), 206. doi:10.1186/s12859-020-3450-9.
Hornik K (2005). "A CLUE for CLUster Ensembles." Journal of Statistical Software, 14(12). doi:10.18637/jss.v014.i12.
Leek, Jeffrey T., and John D. Storey. "A general framework for multiple testing dependence." Proceedings of the National Academy of Sciences 105, no. 48 (2008): 18718-18723. doi:10.1073/pnas.0808709105.
C. Papadimitriou and K. Steiglitz (1982), Combinatorial Optimization: Algorithms and Complexity. Englewood Cliffs: Prentice Hall.
select_counts
For subsampling the rows and columns of your real RNA-seq count matrix prior to applying binomial thinning.
thin_2group
For the specific application of
thin_diff
to the two-group model.
thin_lib
For the specific application of
thin_diff
to library size thinning.
thin_gene
For the specific application of
thin_diff
to total gene expression thinning.
thin_all
For the specific application of
thin_diff
to thinning all counts uniformly.
thin_base
For the underlying thinning function
used in thin_diff
.
sva
For the implementation of surrogate variable analysis.
ThinDataToSummarizedExperiment
For converting a ThinData object to a SummarizedExperiment object.
ThinDataToDESeqDataSet
For converting a ThinData object to a DESeqDataSet object.
## Generate simulated data with surrogate variables ## In practice, you would obtain mat from a real dataset, not simulate it. set.seed(1) n <- 10 p <- 1000 Z <- matrix(abs(rnorm(n, sd = 4))) alpha <- matrix(abs(rnorm(p, sd = 1))) mat <- round(2^(alpha %*% t(Z) + abs(matrix(rnorm(n * p, sd = 5), nrow = p, ncol = n)))) ## Choose simulation parameters design_perm <- cbind(rep(c(0, 1), length.out = n), runif(n)) coef_perm <- matrix(rnorm(p * ncol(design_perm), sd = 6), nrow = p) ## Specify one surrogate variable (number of columns in taget_cor), ## highly correlated with first observed covariate and uncorrelated ## with second observed covariate target_cor <- matrix(c(0.9, 0)) ## Thin thout <- thin_diff(mat = mat, design_perm = design_perm, coef_perm = coef_perm, target_cor = target_cor) ## target_cor approximates correlation between estimated surrogate variable ## and matching variable. cor(thout$matching_var, thout$sv) ## Estimated surrogate variable is associated with true surrogate variable ## (because the signal is strong in this case) plot(Z, thout$sv, xlab = "True SV", ylab = "Estimated SV") ## So target_cor approximates correlation between surrogate variable and ## matching variables cor(thout$matching_var, Z) ## Correlation between permuted covariates and surrogate variables are less ## close to target_cor cor(thout$designmat, Z) ## Estimated signal is correlated to true single. First variable is slightly ## biased because the surrogate variable is not included. Ynew <- log2(t(thout$mat) + 0.5) X <- thout$designmat coef_est <- t(coef(lm(Ynew ~ X))[2:3, ]) plot(thout$coefmat[, 1], coef_est[, 1], main = "First Variable", xlab = "Coefficient", ylab = "Estimated Coefficient") abline(0, 1, col = 2, lwd = 2) plot(thout$coefmat[, 2], coef_est[, 2], main = "Second Variable", xlab = "Coefficient", ylab = "Estimated Coefficient") abline(0, 1, col = 2, lwd = 2) ## But estimated coefficient of the first variable is slightly closer when ## the surrogate variable is included. Ynew <- log2(t(thout$mat) + 0.5) X <- cbind(thout$designmat, thout$sv) coef_est <- t(coef(lm(Ynew ~ X))[2:3, ]) plot(thout$coefmat[, 1], coef_est[, 1], main = "First Variable", xlab = "Coefficient", ylab = "Estimated Coefficient") abline(0, 1, col = 2, lwd = 2) plot(thout$coefmat[, 2], coef_est[, 2], main = "Second Variable", xlab = "Coefficient", ylab = "Estimated Coefficient") abline(0, 1, col = 2, lwd = 2)
## Generate simulated data with surrogate variables ## In practice, you would obtain mat from a real dataset, not simulate it. set.seed(1) n <- 10 p <- 1000 Z <- matrix(abs(rnorm(n, sd = 4))) alpha <- matrix(abs(rnorm(p, sd = 1))) mat <- round(2^(alpha %*% t(Z) + abs(matrix(rnorm(n * p, sd = 5), nrow = p, ncol = n)))) ## Choose simulation parameters design_perm <- cbind(rep(c(0, 1), length.out = n), runif(n)) coef_perm <- matrix(rnorm(p * ncol(design_perm), sd = 6), nrow = p) ## Specify one surrogate variable (number of columns in taget_cor), ## highly correlated with first observed covariate and uncorrelated ## with second observed covariate target_cor <- matrix(c(0.9, 0)) ## Thin thout <- thin_diff(mat = mat, design_perm = design_perm, coef_perm = coef_perm, target_cor = target_cor) ## target_cor approximates correlation between estimated surrogate variable ## and matching variable. cor(thout$matching_var, thout$sv) ## Estimated surrogate variable is associated with true surrogate variable ## (because the signal is strong in this case) plot(Z, thout$sv, xlab = "True SV", ylab = "Estimated SV") ## So target_cor approximates correlation between surrogate variable and ## matching variables cor(thout$matching_var, Z) ## Correlation between permuted covariates and surrogate variables are less ## close to target_cor cor(thout$designmat, Z) ## Estimated signal is correlated to true single. First variable is slightly ## biased because the surrogate variable is not included. Ynew <- log2(t(thout$mat) + 0.5) X <- thout$designmat coef_est <- t(coef(lm(Ynew ~ X))[2:3, ]) plot(thout$coefmat[, 1], coef_est[, 1], main = "First Variable", xlab = "Coefficient", ylab = "Estimated Coefficient") abline(0, 1, col = 2, lwd = 2) plot(thout$coefmat[, 2], coef_est[, 2], main = "Second Variable", xlab = "Coefficient", ylab = "Estimated Coefficient") abline(0, 1, col = 2, lwd = 2) ## But estimated coefficient of the first variable is slightly closer when ## the surrogate variable is included. Ynew <- log2(t(thout$mat) + 0.5) X <- cbind(thout$designmat, thout$sv) coef_est <- t(coef(lm(Ynew ~ X))[2:3, ]) plot(thout$coefmat[, 1], coef_est[, 1], main = "First Variable", xlab = "Coefficient", ylab = "Estimated Coefficient") abline(0, 1, col = 2, lwd = 2) plot(thout$coefmat[, 2], coef_est[, 2], main = "Second Variable", xlab = "Coefficient", ylab = "Estimated Coefficient") abline(0, 1, col = 2, lwd = 2)
Given a matrix of real RNA-seq counts, this function will apply a
separate, user-provided thinning factor to each gene. This uniformly
lowers the counts for all samples in a gene. The thinning factor
should be provided on the log2-scale. This is a specific application
of the binomial thinning approach in thin_diff
. The method is
described in detail in Gerard (2020).
thin_gene(mat, thinlog2, relative = FALSE, type = c("thin", "mult"))
thin_gene(mat, thinlog2, relative = FALSE, type = c("thin", "mult"))
mat |
A numeric matrix of RNA-seq counts. The rows index the genes and the columns index the samples. |
thinlog2 |
A vector of numerics. Element i is the amount to thin (on the log2 scale) for gene i. For example, a value of 0 means that we do not thin, a value of 1 means that we thin by a factor of 2, a value of 2 means we thin by a factor of 4, etc. |
relative |
A logical. Should we apply relative thinning ( |
type |
Should we apply binomial thinning ( |
A list-like S3 object of class ThinData
.
Components include some or all of the following:
mat
The modified matrix of counts.
designmat
The design matrix of variables used to simulate
signal. This is made by column-binding design_fixed
and the
permuted version of design_perm
.
coefmat
A matrix of coefficients corresponding to
designmat
.
design_obs
Additional variables that should be included in
your design matrix in downstream fittings. This is made by
column-binding the vector of 1's with design_obs
.
sv
A matrix of estimated surrogate variables. In simulation studies you would probably leave this out and estimate your own surrogate variables.
cormat
A matrix of target correlations between the
surrogate variables and the permuted variables in the design matrix.
This might be different from the target_cor
you input because
we pass it through fix_cor
to ensure
positive semi-definiteness of the resulting covariance matrix.
matching_var
A matrix of simulated variables used to
permute design_perm
if the target_cor
is not
NULL
.
David Gerard
Gerard, D (2020). "Data-based RNA-seq simulations by binomial thinning." BMC Bioinformatics. 21(1), 206. doi:10.1186/s12859-020-3450-9.
select_counts
For subsampling the rows and columns of your real RNA-seq count matrix prior to applying binomial thinning.
thin_diff
For the more general thinning approach.
thin_lib
For thinning sample-wise instead of gene-wise.
thin_all
For thinning all counts uniformly.
ThinDataToSummarizedExperiment
For converting a ThinData object to a SummarizedExperiment object.
ThinDataToDESeqDataSet
For converting a ThinData object to a DESeqDataSet object.
## Generate count data and thinning factors ## In practice, you would obtain mat from a real dataset, not simulate it. set.seed(1) n <- 10 p <- 1000 lambda <- 1000 mat <- matrix(lambda, ncol = n, nrow = p) thinlog2 <- rexp(n = p, rate = 1) ## Thin total gene expressions thout <- thin_gene(mat = mat, thinlog2 = thinlog2) ## Compare empirical thinning proportions to specified thinning proportions empirical_propvec <- rowMeans(thout$mat) / lambda specified_propvec <- 2 ^ (-thinlog2) plot(empirical_propvec, specified_propvec, xlab = "Empirical Thinning Proportion", ylab = "Specified Thinning Proportion") abline(0, 1, col = 2, lwd = 2)
## Generate count data and thinning factors ## In practice, you would obtain mat from a real dataset, not simulate it. set.seed(1) n <- 10 p <- 1000 lambda <- 1000 mat <- matrix(lambda, ncol = n, nrow = p) thinlog2 <- rexp(n = p, rate = 1) ## Thin total gene expressions thout <- thin_gene(mat = mat, thinlog2 = thinlog2) ## Compare empirical thinning proportions to specified thinning proportions empirical_propvec <- rowMeans(thout$mat) / lambda specified_propvec <- 2 ^ (-thinlog2) plot(empirical_propvec, specified_propvec, xlab = "Empirical Thinning Proportion", ylab = "Specified Thinning Proportion") abline(0, 1, col = 2, lwd = 2)
Given a matrix of real RNA-seq counts, this function will apply a
separate, user-provided thinning factor to each sample. This uniformly
lowers the counts for all genes in a sample. The thinning factor
should be provided on the log2-scale. This is a specific application
of the binomial thinning approach in thin_diff
. The method is
described in detail in Gerard (2020).
thin_lib(mat, thinlog2, relative = FALSE, type = c("thin", "mult"))
thin_lib(mat, thinlog2, relative = FALSE, type = c("thin", "mult"))
mat |
A numeric matrix of RNA-seq counts. The rows index the genes and the columns index the samples. |
thinlog2 |
A vector of numerics. Element i is the amount to thin (on the log2-scale) for sample i. For example, a value of 0 means that we do not thin, a value of 1 means that we thin by a factor of 2, a value of 2 means we thin by a factor of 4, etc. |
relative |
A logical. Should we apply relative thinning ( |
type |
Should we apply binomial thinning ( |
A list-like S3 object of class ThinData
.
Components include some or all of the following:
mat
The modified matrix of counts.
designmat
The design matrix of variables used to simulate
signal. This is made by column-binding design_fixed
and the
permuted version of design_perm
.
coefmat
A matrix of coefficients corresponding to
designmat
.
design_obs
Additional variables that should be included in
your design matrix in downstream fittings. This is made by
column-binding the vector of 1's with design_obs
.
sv
A matrix of estimated surrogate variables. In simulation studies you would probably leave this out and estimate your own surrogate variables.
cormat
A matrix of target correlations between the
surrogate variables and the permuted variables in the design matrix.
This might be different from the target_cor
you input because
we pass it through fix_cor
to ensure
positive semi-definiteness of the resulting covariance matrix.
matching_var
A matrix of simulated variables used to
permute design_perm
if the target_cor
is not
NULL
.
David Gerard
Gerard, D (2020). "Data-based RNA-seq simulations by binomial thinning." BMC Bioinformatics. 21(1), 206. doi:10.1186/s12859-020-3450-9.
select_counts
For subsampling the rows and columns of your real RNA-seq count matrix prior to applying binomial thinning.
thin_diff
For the more general thinning approach.
thin_gene
For thinning gene-wise instead of sample-wise.
thin_all
For thinning all counts uniformly.
ThinDataToSummarizedExperiment
For converting a ThinData object to a SummarizedExperiment object.
ThinDataToDESeqDataSet
For converting a ThinData object to a DESeqDataSet object.
## Generate count data and thinning factors ## In practice, you would obtain mat from a real dataset, not simulate it. set.seed(1) n <- 10 p <- 1000 lambda <- 1000 mat <- matrix(lambda, ncol = n, nrow = p) thinlog2 <- rexp(n = n, rate = 1) ## Thin library sizes thout <- thin_lib(mat = mat, thinlog2 = thinlog2) ## Compare empirical thinning proportions to specified thinning proportions empirical_propvec <- colMeans(thout$mat) / lambda specified_propvec <- 2 ^ (-thinlog2) empirical_propvec specified_propvec
## Generate count data and thinning factors ## In practice, you would obtain mat from a real dataset, not simulate it. set.seed(1) n <- 10 p <- 1000 lambda <- 1000 mat <- matrix(lambda, ncol = n, nrow = p) thinlog2 <- rexp(n = n, rate = 1) ## Thin library sizes thout <- thin_lib(mat = mat, thinlog2 = thinlog2) ## Compare empirical thinning proportions to specified thinning proportions empirical_propvec <- colMeans(thout$mat) / lambda specified_propvec <- 2 ^ (-thinlog2) empirical_propvec specified_propvec
The design formula in the resulting DESeqDataSet is just the sum of all
variables in designmat
from the ThinData object (except the
intercept term). You should change this design formula if you want to
study other models.
ThinDataToDESeqDataSet(obj)
ThinDataToDESeqDataSet(obj)
obj |
A ThinData S3 object. This is generally output by either
|
A DESeqDataSet
S4
object. This will allow you to insert the simulated
data directly into DESeq2.
David Gerard
## Generate simulated data and modify using thin_diff(). ## In practice, you would use real data, not simulated. set.seed(1) n <- 10 p <- 1000 Z <- matrix(abs(rnorm(n, sd = 4))) alpha <- matrix(abs(rnorm(p, sd = 1))) mat <- round(2^(alpha %*% t(Z) + abs(matrix(rnorm(n * p, sd = 5), nrow = p, ncol = n)))) design_perm <- cbind(rep(c(0, 1), length.out = n), runif(n)) coef_perm <- matrix(rnorm(p * ncol(design_perm), sd = 6), nrow = p) design_obs <- matrix(rnorm(n), ncol = 1) target_cor <- matrix(c(0.9, 0)) thout <- thin_diff(mat = mat, design_perm = design_perm, coef_perm = coef_perm, target_cor = target_cor, design_obs = design_obs, permute_method = "hungarian") ## Convert ThinData object to DESeqDataSet object. seobj <- ThinDataToDESeqDataSet(thout) class(seobj) ## The "O1" variable in the colData corresponds to design_obs. ## The "P1" and "P2" variables in colData correspond to design_perm. seobj
## Generate simulated data and modify using thin_diff(). ## In practice, you would use real data, not simulated. set.seed(1) n <- 10 p <- 1000 Z <- matrix(abs(rnorm(n, sd = 4))) alpha <- matrix(abs(rnorm(p, sd = 1))) mat <- round(2^(alpha %*% t(Z) + abs(matrix(rnorm(n * p, sd = 5), nrow = p, ncol = n)))) design_perm <- cbind(rep(c(0, 1), length.out = n), runif(n)) coef_perm <- matrix(rnorm(p * ncol(design_perm), sd = 6), nrow = p) design_obs <- matrix(rnorm(n), ncol = 1) target_cor <- matrix(c(0.9, 0)) thout <- thin_diff(mat = mat, design_perm = design_perm, coef_perm = coef_perm, target_cor = target_cor, design_obs = design_obs, permute_method = "hungarian") ## Convert ThinData object to DESeqDataSet object. seobj <- ThinDataToDESeqDataSet(thout) class(seobj) ## The "O1" variable in the colData corresponds to design_obs. ## The "P1" and "P2" variables in colData correspond to design_perm. seobj
This only keeps the mat
, design_obs
, designmat
,
and coefmat
elements of the ThinData object.
ThinDataToSummarizedExperiment(obj)
ThinDataToSummarizedExperiment(obj)
obj |
A ThinData S3 object. This is generally output by either
|
A SummarizedExperiment
S4
object. This is often used in Bioconductor when performing
differential expression analysis.
David Gerard
## Generate simulated data and modify using thin_diff(). ## In practice, you would use real data, not simulated. set.seed(1) n <- 10 p <- 1000 Z <- matrix(abs(rnorm(n, sd = 4))) alpha <- matrix(abs(rnorm(p, sd = 1))) mat <- round(2^(alpha %*% t(Z) + abs(matrix(rnorm(n * p, sd = 5), nrow = p, ncol = n)))) design_perm <- cbind(rep(c(0, 1), length.out = n), runif(n)) coef_perm <- matrix(rnorm(p * ncol(design_perm), sd = 6), nrow = p) design_obs <- matrix(rnorm(n), ncol = 1) target_cor <- matrix(c(0.9, 0)) thout <- thin_diff(mat = mat, design_perm = design_perm, coef_perm = coef_perm, target_cor = target_cor, design_obs = design_obs, permute_method = "hungarian") ## Convert ThinData object to SummarizedExperiment object. seobj <- ThinDataToSummarizedExperiment(thout) class(seobj) ## The "O1" variable in the colData corresponds to design_obs. ## The "P1" and "P2" variables in colData correspond to design_perm. seobj
## Generate simulated data and modify using thin_diff(). ## In practice, you would use real data, not simulated. set.seed(1) n <- 10 p <- 1000 Z <- matrix(abs(rnorm(n, sd = 4))) alpha <- matrix(abs(rnorm(p, sd = 1))) mat <- round(2^(alpha %*% t(Z) + abs(matrix(rnorm(n * p, sd = 5), nrow = p, ncol = n)))) design_perm <- cbind(rep(c(0, 1), length.out = n), runif(n)) coef_perm <- matrix(rnorm(p * ncol(design_perm), sd = 6), nrow = p) design_obs <- matrix(rnorm(n), ncol = 1) target_cor <- matrix(c(0.9, 0)) thout <- thin_diff(mat = mat, design_perm = design_perm, coef_perm = coef_perm, target_cor = target_cor, design_obs = design_obs, permute_method = "hungarian") ## Convert ThinData object to SummarizedExperiment object. seobj <- ThinDataToSummarizedExperiment(thout) class(seobj) ## The "O1" variable in the colData corresponds to design_obs. ## The "P1" and "P2" variables in colData correspond to design_perm. seobj
Group assignment independent of anything.
uncorassign(n, return = c("group", "full"))
uncorassign(n, return = c("group", "full"))
n |
The sample size. |
return |
Should we just return a list with just the
vector of assignment ( |
A list with some or all of the following elements.
x
The group assignment. 1L
for one group and
0L
for the other group.
w
The latent assignment vector (only returned if
return = "full"
). Negative corresponds to one group
and positive corresponds to the other group.
David Gerard