Disentanglement of single-cell data with biolord

Disentanglement of single-cell data with biolord

Main

A cell’s gene expression profile simultaneously encodes information about multiple attributes, such as cell type, tissue of origin and differentiation stage (Fig. 1a). Single-cell technologies can provide information about such expression profiles for cellular populations at single-cell resolution. Yet, it is still a major challenge to decode the measured gene expression, disentangling the processes from one another. A disentangled representation can uncover the existence and characteristics of diverse biological processes, allowing the reconstruction of multiple attributes of cellular identity such as response to perturbations and infection progression. Earlier studies suggested using factor analysis1,2 or non-negative matrix factorization3 to identify programs associated with different attributes. Recently, computational methods that specialize in disentanglement for a specific task were suggested; among the addressed tasks are decoupling perturbation response4,5,6,7,8, disentangling group-specific attributes9 or out-of-distribution sampling of single-cell data10,11. However, these are either task-specific and do not address the general disentanglement problem, rely on linearity and independence assumptions, cannot integrate multiple types of information beyond the single-cell measurements or do not provide a generic reconstruction procedure.

Fig. 1: The biolord framework for disentanglement of known and unknown attributes.

a, Single-cell data encode multiple attributes of cellular identity. b, Schematic overview of the biolord model; given single-cell measurements and labels for observed attributes, biolord encodes each attribute separately along with a single encoding of the unknown attributes. These define a decomposed latent space that is the input for the generative module providing measurement predictions. c, Biolord can be used for multiple downstream tasks. From left to right—latent space representation: the decomposed latent space can be used to obtain insights into the underlying structure of individual attributes. Counterfactual predictions: given a control cell and unseen (target) labels as input, biolord can predict the gene expression of the unseen cellular states and study the changes in gene expression that correspond to a manipulation of a cell’s attribute. Association of features to state: by manipulating the known attributes, biolord can identify measured features associated with the different possible states, for example, by manipulating control cells to an infected state and identifying genes associated with infection. Attribute classification: using the semi-supervised biolord architecture, cells can be labeled with missing attributes. d, Schematic overview for obtaining counterfactual predictions. We take as input measurements of a set of reference cells with varying assignment(s) to the attribute over which predictions are made. For example, we take as input control cells along with multiple drugs that can be applied to generate counterfactual predictions as to how the gene expression profiles of these cells would have been shifted given each drug (Methods). e, An evaluation of biolord’s performance on predictions of unseen drugs over the sci-Plex 3 dataset that includes ~650,000 single-cell transcriptomes from three cancer cell lines exposed to 188 compounds16. Results are reported for the 10 μM dosage, considered to be the strictest setting since measurements show the largest deviation from control state, which makes them hardest to predict. Mean and variance are reported over ten different random seed initializations of each model. Figure panels ad are created with BioRender.com.

In machine learning, disentanglement methods view the world as generated by an unknown forward process that maps the generative factors (attributes) into the observable data. For example, an image of a car is generated by several attributes such as model and pose. The objective of disentanglement is to invert this process, for example, mapping the car image into variables representing its model and pose. The disentangled representation can then be used for data manipulation, generating unseen combinations of model and pose. Analogously, in the biological setting, given labeled single-cell data, for example, cell type and age annotations (known attributes), a disentangled representation will decouple known attributes, cell type and age, from the unknown attributes. The unknown attributes correspond to a cell-specific signature, for example, related to batch effects, biological noise or unclassified biological processes. The disentangled representation can be used for data generation, manipulation and deriving biological insight (for example, predicting the measured features of unobserved combinations of cell type and age or identifying driver genes of certain cell type or state).

Using recent advances in disentanglement from the computer vision field12,13, we present biolord (biological representation disentanglement), a deep generative framework for learning disentangled representations in single-cell data (Methods). To disentangle single-cell data into its underlying attributes, we assume a training set consisting of single-cell measurements, each with partial supervision over a limited set of known attributes. For example, the known attributes may be cell-type labels, measurement time or perturbation values; attributes may be categorical (discrete; for example, cell type) or ordered (continuous; for example, age). Given the partial supervision, biolord finds a disentangled latent space, consisting of embeddings for each known attribute and an embedding for the remaining unknown attributes in the data (Fig. 1b). On top of these, biolord learns a generator, which maps the representations of the known and unknown attributes into observable single-cell data. It can, in turn, use the disentangled latent space to predict single-cell measurements for different cell states across variations in internal or external conditions. Successful disentanglement is obtained by inducing information constraints; the model’s loss function attempts to maximize the accuracy of the reconstruction (enforcing completeness) while minimizing the information encoded in the unknown attributes (limiting its capacity). We modify the original framework, dedicated to image analysis12,13, to account for the features of single-cell data through architecture and design choices (Fig. 1b and Supplementary Note 1; Methods). Furthermore, we present an extension to the framework, biolord-classify, which can be applied to datasets with partially labeled attributes and provides a classification for missing labels (Methods; Extended Data Fig. 1).

The generality of the framework allows its application to diverse biological settings that can be studied with a rich set of downstream analysis tasks (Fig. 1c; Methods). Using the generative aspect of the model, we can make counterfactual predictions, predicting unseen cellular states and performing data manipulation. Applied for the prediction of responses to unseen drugs or gene perturbations, biolord outperforms state-of-the-art methods dedicated to this task. The decomposed latent space representation allows studying the different attributes and their inner structure independently. For example, this representation of the human fetal chromatin atlas revealed the relationships between tissue, sample estimated post-conceptual age and cell-type attributes (Supplementary Notes 1 and 2 and Supplementary Figs. 1–3). Moreover, we can associate measured features with a cell state. At last, biolord can be applied to a partially labeled dataset and used to obtain labeling over the entire dataset (attribute classification). We apply this to a spatiotemporal Plasmodium infection atlas to complete the missing classification of a distinct state (initially provided only for the latest time point), thereby allowing us to study the transient trajectory toward the infected state. We implemented biolord using the scvi-tools library14 and made it available at https://github.com/nitzanlab/biolord.

Biolord accurately predicts cellular perturbation response

Accurate prediction of molecular responses to drug or genetic perturbations is central to our understanding of cellular behavior and translational medicine. Hence, many computational tools are dedicated to this task5,6,7,8,15 (Supplementary Note 3). Among these are chemCPA5, for drug response prediction, GEARS6 for genetic perturbations and PerturbNet8, which addresses both (Supplementary Note 3). Cellular response prediction can be framed as a disentanglement task, aimed at decoupling perturbation response from the underlying cell state, and therefore can be approached by biolord. For the drug response prediction task, we use the sci-Plex 3 dataset that includes ~650,000 single-cell transcriptomes from three cancer cell lines exposed to 188 compounds at four different dosages and control samples16 (Supplementary Fig. 4).

To allow generalization to unseen drugs, we take advantage of existing prior knowledge and obtain chemically informed embedding of the drugs using RDKit features5,17. For each cell, the features of each drug, alongside its dosage, cell line and corresponding scRNA-seq measurements, are given as input to biolord (Methods; Supplementary Note 3). Biolord’s learned latent representation is biologically informative; it reveals drug organization according to known corresponding pathways, and better captures underlying drug organization, relative to the chemically informed RDKit features used as input, both qualitatively and quantitatively (adjusted Rand index RDKit: 0.03, biolord: 0.16; Supplementary Fig. 4). To further evaluate the quality of biolord’s drugs representation, we employ the uncertainty measure suggested by ref. 5, assessing the ability to predict the drug’s pathway from the k-nearest neighbor (k-NN) graph of the embedding space (Methods). Compared to RDKit, biolord’s uncertainty measure is found to be lower on average and more concentrated (distribution evaluated over all drugs; RDKit: 0.32 ± 0.008, biolord: 0.19 ± 0.005; Supplementary Fig. 4).

We use the trained biolord model to obtain counterfactual predictions for nine unseen drugs (reported among the most effective drugs in sci-Plex 3 data16, following the choice suggested in ref. 5). Specifically, we generate the expression prediction for control cells with labels of unseen compounds. Performance is evaluated using the r2 score between the real measurements of cells exposed to the unseen compounds and the counterfactual predictions (Fig. 1d; Methods). Biolord outperforms a naive baseline (comparing real measurements of unseen compounds to the control measurements), as well as state-of-the-art models, chemCPA and chemCPA-pre (Fig. 1e, Supplementary Fig. 4 and Supplementary Note 3). Although not provided with the additional information used by chemCPA-pre, biolord provides more accurate predictions (mean r2; chemCPA-pre: 0.51 ± 0.0062, biolord: 0.76 ± 0.0005). Biolord also outperforms PerturbNet8 (Supplementary Note 3) and is robust to data subsampling, retaining high prediction accuracy (mean r2: 0.63 ± 0.0003) over 10% of the data (Supplementary Fig. 5).

To demonstrate biolord’s application to the genetic perturbation setting, we consider two genetic perturbation screens that use the Perturb-seq assay18. The first is a dataset consisting of 81 one-gene perturbations suggested by ref. 19, and the second is a dataset suggested in ref. 20 that includes 131 two-gene perturbations and 105 one-gene perturbations. In this setting, to allow for generalization, we use features that are based on edges in a GO term graph defined over genetic perturbations as defined in ref. 6 (Methods). We show that biolord outperforms GEARS in the prediction of unseen one-gene perturbation (normalized mean squared error, one of one gene unseen; GEARS: 0.47; biolord: 0.37) and two-gene perturbations (normalized mean squared error, two of two genes unseen; GEARS: 0.53; biolord: 0.50, one of one gene unseen; GEARS: 0.39; biolord: 0.35, zero of two genes unseen; GEARS: 0.28; biolord: 0.20; Methods; Supplementary Note 3 and Extended Data Fig. 2).

Counterfactual predictions expose infection gene programs

The collection of spatiotemporal single-cell atlases is continuously expanding, each capturing a complex biological setting. Among the computational challenges is disentangling the diverse attributes, thereby associating the measured features with distinct cell states. Focusing on a spatiotemporal single-cell atlas of Plasmodium infection progression in the mouse liver21, we show that biolord can obtain a disentangled representation that allows for uncovering infection-related attributes. Single-cell data, including host and parasite transcriptome, were collected from infected mice at five time points post-infection (2, 12, 24, 30 and 36 h post-infection (hpi)), as well as from control mice, not exposed to the parasite (control; Fig. 2a and Extended Data Fig. 3). To classify hepatocytes as infected or uninfected, the authors relied on GFP content in the parasite transcriptome21 (Fig. 2b). Using biolord, we aimed at decoupling the changes in gene expression in the host hepatocytes induced by the infection from the variability rooted in previously established spatiotemporal processes22,23—either in spatial zonation across liver lobules radial axis or in temporal variation along the time of day (Fig. 2a and Extended Data Fig. 3).

Fig. 2: Recovering transient states by classifying unknown cell states using biolord.

a,b, UMAPs of the single-cell atlas of the Plasmodium liver stage21. Cells are colored by time after infection (a) and reported classification to infected/uninfected and control cells (b). c, UMAP of the original control cells with their counterfactual predictions (c-pred.) for infected/uninfected state; cells are colored by the corresponding state. d, GSEA of genes found to be associated with the infected state based on biolord’s counterfactual predictions of the infection state in control cells. H denotes Hallmark gene sets; K denotes KEGG gene sets (Padj is calculated using a permutation test with Benjamini–Hochberg correction). e,f, UMAPs of the infected cells from intermediate to late time points in the single-cell atlas of the Plasmodium liver stage21. Cells are colored by the reported abortive/productive classification of cells at 36 hpi (e) and biolord’s classification of all infected cells as abortive/productive (f). The inset shows the fraction of abortive cells at each time point (24 hpi, 0.016; 30 hpi, 0.057 and 36 hpi, 0.215). g, Box plot comparing abortive and productive cells shows that abortive hepatocytes retain a smaller fraction of Plasmodium transcriptome across all time points. Middle line in box plot, median; box boundary, IQR; whiskers, 1.5× IQR; minimum and maximum, not indicated in the box plot; gray dots, points beyond the minimum or maximum whisker (Mann–Whitney–Wilcoxon test two-sided with Benjamini–Hochberg correction: 24 and 30 hpi (n = 1,823 cells across two states); biolord-classify
>>> Read full article>>>
Copyright for syndicated content belongs to the linked Source : Nature.com – https://www.nature.com/articles/s41587-023-02079-x

Exit mobile version