scdrs.preprocess#
- scdrs.preprocess(data, cov=None, adj_prop=None, n_mean_bin=20, n_var_bin=20, n_chunk=None, copy=False)[source]#
Preprocess single-cell data for scDRS analysis.
1. Correct covariates by regressing out the covariates (including a constant term) and adding back the original mean for each gene.
2. Compute gene-level and cell-level statistics for the covariate-corrected data.
Information is stored in data.uns[“SCDRS_PARAM”]. It operates in implicit-covariate-correction mode when data.X is sparse and cov not None to improve memory efficiency; it operates in normal mode otherwise.
In normal mode, data.X is replaced by the covariate-corrected data.
In implicit-covariate-correction mode, the covariate correction information is stored in data.uns[“SCDRS_PARAM”] but is not explicitly applied to data.X, so that data.X is always sparse. Subsequent computations on the covariate-corrected data are based on the original data data.X and the covariate correction information. Specifically,
CORRECTED_X = data.X + COV_MAT * COV_BETA + COV_GENE_MEAN
The adj_prop option is used for adjusting for cell group proportions, where each cell is inversely weighted proportional to its corresponding cell group size for computing expression mean and variance for genes. For stability, the smallest group size is set to be at least 1% of the largest group size.
- Parameters:
- dataanndata.AnnData
Single-cell data of shape (n_cell, n_gene). Assumed to be size-factor-normalized and log1p-transformed.
- covpandas.DataFrame, default=None
Covariates of shape (n_cell, n_cov). Should contain a constant term and have values for at least 75% cells.
- adj_propstr, default=None
Cell group annotation (e.g., cell type) used for adjusting for cell group proportions. adj_prop should be present in data.obs.columns.
- n_mean_binint, default=20
Number of mean-expression bins for matching control genes.
- n_var_binint, default=20
Number of expression-variance bins for matching control genes.
- n_chunkint, default=None
Number of chunks to split the data into when computing mean and variance using _get_mean_var_implicit_cov_corr. If n_chunk is None, set to 5/sparsity.
- copybool, default=False
Return a copy instead of writing to data.
- Returns:
- Overview:
data.X will be updated as the covariate-corrected data in normal mode and will stay untouched in the implicit covariate correctoin mode. Preprocessing information is stored in data.uns[“SCDRS_PARAM”].
- FLAG_SPARSEbool
If data.X is sparse.
- FLAG_COVbool
If covariate correction is performed.
- COV_MATpandas.DataFrame
Covariate matrix of shape (n_cell, n_cov).
- COV_BETA: pandas.DataFrame
Covariate effect sizes of shape (n_gene, n_cov).
- COV_GENE_MEAN: pandas.Series
Gene-level mean expression.
- GENE_STATSpandas.DataFrame
Gene-level statistics of shape (n_gene, 7):
“mean” : mean expression in log scale.
“var” : expression variance in log scale.
“var_tech” : technical variance in log scale.
“ct_mean” : mean expression in original non-log scale.
“ct_var” : expression variance in original non-log scale.
“ct_var_tech” : technical variance in original non-log scale.
“mean_var” : n_mean_bin * n_var_bin mean-variance bins
- CELL_STATSpandas.DataFrame
Cell-level statistics of shape (n_cell, 2):
“mean” : mean expression in log scale.
“var” : variance expression in log scale.
Notes
- Covariate regression:
adata.X = cov * beta + resid_X.
- scDRS saves:
COV_MAT = cov, COV_BETA = (-beta), COV_GENE_MEAN = adata.X.mean(axis=0)
- The scDRS covariate-corrected data:
CORRECTED_X = resid_X + GENE_MEAN = adata.X + COV_MAT * COV_BETA + COV_GENE_MEAN.