scdrs.preprocess#

scdrs.preprocess(data, cov=None, adj_prop=None, n_mean_bin=20, n_var_bin=20, n_chunk=None, copy=False)[source]#

Preprocess single-cell data for scDRS analysis.

1. Correct covariates by regressing out the covariates (including a constant term) and adding back the original mean for each gene.

2. Compute gene-level and cell-level statistics for the covariate-corrected data.

Information is stored in data.uns[“SCDRS_PARAM”]. It operates in implicit-covariate-correction mode when data.X is sparse and cov not None to improve memory efficiency; it operates in normal mode otherwise.

In normal mode, data.X is replaced by the covariate-corrected data.

In implicit-covariate-correction mode, the covariate correction information is stored in data.uns[“SCDRS_PARAM”] but is not explicitly applied to data.X, so that data.X is always sparse. Subsequent computations on the covariate-corrected data are based on the original data data.X and the covariate correction information. Specifically,

CORRECTED_X = data.X + COV_MAT * COV_BETA + COV_GENE_MEAN

The adj_prop option is used for adjusting for cell group proportions, where each cell is inversely weighted proportional to its corresponding cell group size for computing expression mean and variance for genes. For stability, the smallest group size is set to be at least 1% of the largest group size.

Parameters:
dataanndata.AnnData

Single-cell data of shape (n_cell, n_gene). Assumed to be size-factor-normalized and log1p-transformed.

covpandas.DataFrame, default=None

Covariates of shape (n_cell, n_cov). Should contain a constant term and have values for at least 75% cells.

adj_propstr, default=None

Cell group annotation (e.g., cell type) used for adjusting for cell group proportions. adj_prop should be present in data.obs.columns.

n_mean_binint, default=20

Number of mean-expression bins for matching control genes.

n_var_binint, default=20

Number of expression-variance bins for matching control genes.

n_chunkint, default=None

Number of chunks to split the data into when computing mean and variance using _get_mean_var_implicit_cov_corr. If n_chunk is None, set to 5/sparsity.

copybool, default=False

Return a copy instead of writing to data.

Returns:
Overview:

data.X will be updated as the covariate-corrected data in normal mode and will stay untouched in the implicit covariate correctoin mode. Preprocessing information is stored in data.uns[“SCDRS_PARAM”].

FLAG_SPARSEbool

If data.X is sparse.

FLAG_COVbool

If covariate correction is performed.

COV_MATpandas.DataFrame

Covariate matrix of shape (n_cell, n_cov).

COV_BETA: pandas.DataFrame

Covariate effect sizes of shape (n_gene, n_cov).

COV_GENE_MEAN: pandas.Series

Gene-level mean expression.

GENE_STATSpandas.DataFrame

Gene-level statistics of shape (n_gene, 7):

  • “mean” : mean expression in log scale.

  • “var” : expression variance in log scale.

  • “var_tech” : technical variance in log scale.

  • “ct_mean” : mean expression in original non-log scale.

  • “ct_var” : expression variance in original non-log scale.

  • “ct_var_tech” : technical variance in original non-log scale.

  • “mean_var” : n_mean_bin * n_var_bin mean-variance bins

CELL_STATSpandas.DataFrame

Cell-level statistics of shape (n_cell, 2):

  • “mean” : mean expression in log scale.

  • “var” : variance expression in log scale.

Notes

Covariate regression:

adata.X = cov * beta + resid_X.

scDRS saves:

COV_MAT = cov, COV_BETA = (-beta), COV_GENE_MEAN = adata.X.mean(axis=0)

The scDRS covariate-corrected data:

CORRECTED_X = resid_X + GENE_MEAN = adata.X + COV_MAT * COV_BETA + COV_GENE_MEAN.