Deep generative modeling of mosaic single-cell multimodal data
For cell (nin {{{mathcal{N}}}}={1,ldots ,N}) with batch ID ({s}_{n}in {{{mathcal{B}}}}={1,ldots ,B}), let ({{{{boldsymbol{x}}}}}_{n}^{m}in {{mathbb{N}}}^{{D}_{n}^{m}}) be the count vector of size ({D}_{n}^{m}) from modality m and ({{{{boldsymbol{x}}}}}_{n}={{{{{{boldsymbol{x}}}}}_{n}^{m}}}_{min {{{{mathcal{M}}}}}_{n}}) be the set of count vectors from the measured modalities ({{{{mathcal{M}}}}}_{n}subseteq {{{mathcal{M}}}}={{{{rm{ATAC}}}},{{{rm{RNA}}}},{{{rm{ADT}}}}}). We define two modality-agnostic low-dimensional latent variables ({{{boldsymbol{c}}}}in {{mathbb{R}}}^{{D}^{c}}) and ({{{boldsymbol{u}}}}in {{mathbb{R}}}^{{D}^{u}}) to represent each cell’s biological state and technical noise, respectively. To model the generative process of the observed variables x and s for each cell, we factorize the joint distribution of all variables as
$$begin{array}{ll}p({{{boldsymbol{x}}}},s,{{{boldsymbol{c}}}},{{{boldsymbol{u}}}})&=p({{{boldsymbol{c}}}})p({{{boldsymbol{u}}}})p(s| {{{boldsymbol{u}}}})p({{{boldsymbol{x}}}}| {{{boldsymbol{c}}}},{{{boldsymbol{u}}}})\ &=p({{{boldsymbol{c}}}})p({{{boldsymbol{u}}}})p(s| {{{boldsymbol{u}}}})mathop{prod}limits_{min {{{{mathcal{M}}}}}_{n}}p({{{{boldsymbol{x}}}}}^{m}| {{{boldsymbol{c}}}},{{{boldsymbol{u}}}})end{array}$$
(1)
where we assume that c and u are independent of each other, the batch ID s only depends on u to facilitate the disentanglement of both latent variables, and the count variables ({{{{{{boldsymbol{x}}}}}^{m}}}_{min {{{{mathcal{M}}}}}_{n}}) from different modalities are conditional independent given c and u.
Based on the above factorization, we define a generative model for x and s as
$$p({{{boldsymbol{c}}}})={{mathrm{Normal}}},({{{boldsymbol{c}}}}| {{{boldsymbol{0}}}},{{{boldsymbol{I}}}})$$
(2)
$$p({{{boldsymbol{u}}}})={{mathrm{Normal}}},({{{boldsymbol{u}}}}| {{{boldsymbol{0}}}},{{{boldsymbol{I}}}})$$
(3)
$${{{boldsymbol{pi }}}}={g}^{s}({{{boldsymbol{u}}}};{{{{boldsymbol{theta }}}}}^{s})$$
(4)
$${p}_{{{{boldsymbol{theta }}}}}(s| {{{boldsymbol{u}}}})={{mathrm{Categorical}}},(s| {{{boldsymbol{pi }}}})$$
(5)
$${{{{boldsymbol{lambda }}}}}^{m}={g}^{m}({{{boldsymbol{c}}}},{{{boldsymbol{u}}}};{{{{boldsymbol{theta }}}}}^{m}),{{{rm{for}}}},min {{{{mathcal{M}}}}}_{n}$$
(6)
$${p}_{{{{boldsymbol{theta }}}}}({{{{boldsymbol{x}}}}}^{m}| {{{boldsymbol{c}}}},{{{boldsymbol{u}}}})=left{begin{array}{l}{{mathrm{Bernoulli}}},({{{{boldsymbol{x}}}}}^{m}| {{{{boldsymbol{lambda }}}}}^{m}),{{{rm{if}}}},m={{{rm{ATAC}}}}quad \ {{mathrm{Poisson}}},({{{{boldsymbol{x}}}}}^{m}| {{{{boldsymbol{lambda }}}}}^{m}),{{{rm{if}}}},min {{{{rm{RNA}}}},{{{rm{ADT}}}}}quad end{array}right.{{{rm{for}}}},min {{{{mathcal{M}}}}}_{n}$$
(7)
where the priors p(c) and p(u) are set as standard Gaussians. The likelihood pθ(s|u) is set as a categorical distribution with probability vector π ∈ ΔB−1 generated through a batch ID decoder gs, which is a neural network with learnable parameters θs. The likelihood pθ(xm|c, u) is set as a Bernoulli distribution with mean ({{{{boldsymbol{lambda }}}}}^{m}in {[0,1]}^{{D}_{n}^{m}}) when m = ATAC and as a Poisson distribution with mean ({{{{boldsymbol{lambda }}}}}^{m}in {{mathbb{R}}}_{+}^{{D}_{n}^{m}}) when m ∈ {RNA, ADT}, where λm is generated through a modality decoder neural network gm parameterized by θm. To mitigate overfitting and improve generalization, we share parameters of the first few layers of different modality decoders ({{{g}^{m}}}_{min {{{mathcal{M}}}}}) (the gray parts of the decoders in Fig. 1b, middle). The corresponding graphical model is shown in Fig. 1b (left).
Given the observed data ({{{{{{boldsymbol{x}}}}}_{n},{s}_{n}}}_{nin {{{mathcal{N}}}}}), we aim to fit the model parameters ({{{boldsymbol{theta }}}}={{{{{boldsymbol{theta }}}}}^{s},{{{{{{boldsymbol{theta }}}}}^{m}}}_{min {{{mathcal{M}}}}}}) and meanwhile infer the posteriors of latent variables {c, u} for each cell. This can be achieved by using the stochastic gradient variational Bayes41, which maximizes the expected evidence lower bound (ELBO) for individual data points. The ELBO for each individual data point {xn, sn} can be written as
$$begin{array}{ll}{{mathrm{ELBO}}},({{{boldsymbol{theta }}}},{{{boldsymbol{phi }}}};{{{{boldsymbol{x}}}}}_{n},{s}_{n})\ triangleq {{mathbb{E}}}_{displaystyle{q}_{{{{boldsymbol{phi }}}}}displaystyle({{{boldsymbol{c}}}},{{{boldsymbol{u}}}}| {{{{boldsymbol{x}}}}}_{n},{s}_{n})}left[log displaystylefrac{{p}_{{{{boldsymbol{theta }}}}}({{{{boldsymbol{x}}}}}_{n},{s}_{n},{{{boldsymbol{c}}}},{{{boldsymbol{u}}}})}{{q}_{{{{boldsymbol{phi }}}}}({{{boldsymbol{c}}}},{{{boldsymbol{u}}}}| {{{{boldsymbol{x}}}}}_{n},{s}_{n})}right]\=displaystyle{{mathbb{E}}}_{displaystyle{q}_{{{{boldsymbol{phi }}}}}({{{boldsymbol{c}}}},{{{boldsymbol{u}}}}| {{{{boldsymbol{x}}}}}_{n},{s}_{n})}displaystyleleft[log {p}_{{{{boldsymbol{theta }}}}}({{{{boldsymbol{x}}}}}_{n},{s}_{n}| {{{boldsymbol{c}}}},{{{boldsymbol{u}}}})right]-{{{rm{KL}}}}left[displaystyle{q}_{{{{boldsymbol{phi }}}}}({{{boldsymbol{c}}}},{{{boldsymbol{u}}}}| {{{{boldsymbol{x}}}}}_{n},{s}_{n})parallel p({{{boldsymbol{c}}}},{{{boldsymbol{u}}}})right]\=displaystyle{{mathbb{E}}}_{displaystyle{q}_{{{{boldsymbol{phi }}}}}({{{boldsymbol{c}}}},{{{boldsymbol{u}}}}| {{{{boldsymbol{x}}}}}_{n},{s}_{n})}left[log {p}_{{{{boldsymbol{theta }}}}}({s}_{n}| {{{boldsymbol{u}}}})+mathop{sum}limits_{min {{{{mathcal{M}}}}}_{n}}log {p}_{{{{boldsymbol{theta }}}}}({{{{boldsymbol{x}}}}}_{n}^{m}| {{{boldsymbol{c}}}},{{{boldsymbol{u}}}})right]\ quad-{{{rm{KL}}}}left[{q}_{{{{boldsymbol{phi }}}}}({{{boldsymbol{c}}}},{{{boldsymbol{u}}}}| {{{{boldsymbol{x}}}}}_{n},{s}_{n})parallel p({{{boldsymbol{c}}}},{{{boldsymbol{u}}}})right]end{array}$$
(8)
where qϕ(c, u|xn, sn), with learnable parameters ϕ, is the variational approximation of the true posterior p(c, u|xn, sn) and is typically implemented by neural networks, and KL( ⋅ ∥ ⋅ ) is the Kullback–Leibler divergence between two distributions.
Scalable variational inference via the product of experts
Let (M=| {{{mathcal{M}}}}|) be the total modality number. Because there are (2M − 1) possible modality combinations for the count data ({{{{boldsymbol{x}}}}}_{n}={{{{{{boldsymbol{x}}}}}_{n}^{m}}}_{min {{{{mathcal{M}}}}}_{n}subseteq {{{mathcal{M}}}}}), naively implementing qϕ(c, u|xn, sn) in Eq. (8) requires (2M − 1) different neural networks to handle different cases of input (xn, sn), making inference unscalable. Let z = {c, u}. Inspired by Wu and Goodman57, which uses the product of experts to implement variational inference in a combinatorial way, we factorize the posterior p(z|xn, sn) and define its variational approximation qϕ(z|xn, sn) as follows:
$$begin{array}{ll}displaystyle p({{{boldsymbol{z}}}}| {{{{boldsymbol{x}}}}}_{n},{s}_{n})&=displaystyle frac{p({s}_{n})}{p({{{{boldsymbol{x}}}}}_{n},{s}_{n})}left(mathop{prod}limits_{min {{{{mathcal{M}}}}}_{n}}p({{{{boldsymbol{x}}}}}_{n}^{m})right)p({{{boldsymbol{z}}}})frac{p({{{boldsymbol{z}}}}| {s}_{n})}{p({{{boldsymbol{z}}}})}mathop{prod}limits_{min {{{{mathcal{M}}}}}_{n}}frac{p({{{boldsymbol{z}}}}| {{{{boldsymbol{x}}}}}_{n}^{m})}{p({{{boldsymbol{z}}}})}\ &approx displaystylefrac{p({s}_{n})}{p({{{{boldsymbol{x}}}}}_{n},{s}_{n})}left(mathop{prod}limits_{min {{{{mathcal{M}}}}}_{n}}p({{{{boldsymbol{x}}}}}_{n}^{m})right)p({{{boldsymbol{z}}}})frac{{q}_{{{{boldsymbol{phi }}}}}({{{boldsymbol{z}}}}| {s}_{n})}{p({{{boldsymbol{z}}}})}mathop{prod}limits_{min {{{{mathcal{M}}}}}_{n}}frac{{q}_{{{{boldsymbol{phi }}}}}({{{boldsymbol{z}}}}| {{{{boldsymbol{x}}}}}_{n}^{m})}{p({{{boldsymbol{z}}}})}\ &triangleq , displaystyle {q}_{{{{boldsymbol{phi }}}}}({{{boldsymbol{z}}}}| {{{{boldsymbol{x}}}}}_{n},{s}_{n})end{array}$$
(9)
where qϕ(z|sn) and ({q}_{{{{boldsymbol{phi }}}}}({{{boldsymbol{z}}}}| {{{{boldsymbol{x}}}}}_{n}^{m})) are the variational approximations of the true posteriors p(z|sn) and (p({{{boldsymbol{z}}}}| {{{{boldsymbol{x}}}}}_{n}^{m})), respectively (see Supplementary Note 3 for detailed derivation). Let ({widetilde{q}}_{{{{boldsymbol{phi }}}}}({{{boldsymbol{z}}}}| {s}_{n})=frac{1}{{C}^{s}}frac{{q}_{{{{boldsymbol{phi }}}}}({{{boldsymbol{z}}}}| {s}_{n})}{p({{{boldsymbol{z}}}})}) and ({widetilde{q}}_{{{{boldsymbol{phi }}}}}({{{boldsymbol{z}}}}| {{{{boldsymbol{x}}}}}_{n}^{m})=frac{1}{{C}^{m}}frac{{q}_{{{{boldsymbol{phi }}}}}({{{boldsymbol{z}}}}| {{{{boldsymbol{x}}}}}_{n}^{m})}{p({{{boldsymbol{z}}}})}) be the normalized quotients of distributions with normalizing constants Cs and Cm, respectively. From Eq. (9), we further acquire
$$begin{array}{ll}displaystyle{q}_{{{{boldsymbol{phi }}}}}({{{boldsymbol{z}}}}| {{{{boldsymbol{x}}}}}_{n},{s}_{n})& displaystyle propto p({{{boldsymbol{z}}}})frac{{q}_{{{{boldsymbol{phi }}}}}({{{boldsymbol{z}}}}| {s}_{n})}{p({{{boldsymbol{z}}}})}mathop{prod}limits_{min {{{{mathcal{M}}}}}_{n}}frac{{q}_{{{{boldsymbol{phi }}}}}({{{boldsymbol{z}}}}| {{{{boldsymbol{x}}}}}_{n}^{m})}{p({{{boldsymbol{z}}}})}\ &=displaystyle p({{{boldsymbol{z}}}}){C}^{s}{widetilde{q}}_{{{{boldsymbol{phi }}}}}({{{boldsymbol{z}}}}| {s}_{n})mathop{prod}limits_{min {{{{mathcal{M}}}}}_{n}}{C}^{m}{widetilde{q}}_{{{{boldsymbol{phi }}}}}({{{boldsymbol{z}}}}| {{{{boldsymbol{x}}}}}_{n}^{m})\ &propto displaystyle p({{{boldsymbol{z}}}}){widetilde{q}}_{{{{boldsymbol{phi }}}}}({{{boldsymbol{z}}}}| {s}_{n})mathop{prod}limits_{min {{{{mathcal{M}}}}}_{n}}{widetilde{q}}_{{{{boldsymbol{phi }}}}}({{{boldsymbol{z}}}}| {{{{boldsymbol{x}}}}}_{n}^{m})end{array}$$
(10)
where we set qϕ(z|sn) and ({q}_{{{{boldsymbol{phi }}}}}({{{boldsymbol{z}}}}| {{{{boldsymbol{x}}}}}_{n}^{m})) to be diagonal Gaussians, resulting in ({widetilde{q}}_{{{{boldsymbol{phi }}}}}({{{boldsymbol{z}}}}| {s}_{n})) and ({widetilde{q}}_{{{{boldsymbol{phi }}}}}({{{boldsymbol{z}}}}| {{{{boldsymbol{x}}}}}_{n}^{m})) being diagonal Gaussians, which are defined as
$$displaystyle left({{{{boldsymbol{mu }}}}}_{n}^{s},{{{{boldsymbol{nu }}}}}_{n}^{s}right)={displaystyle{f}^{,s}({s}_{n};{{{{boldsymbol{phi }}}}}^{s})}$$
(11)
$$displaystyle {widetilde{q}}_{{{{boldsymbol{phi }}}}}({{{boldsymbol{z}}}}| {s}_{n})=displaystyle {{mathrm{Normal}}}, {displaystyle left[{{{boldsymbol{z}}}}| {{{{boldsymbol{mu }}}}}_{n}^{s},{{mathrm{diag}}},left({{{{boldsymbol{nu }}}}}_{n}^{s}right)right]}$$
(12)
$${displaystyle left({{{{boldsymbol{mu }}}}}_{n}^{m},{{{{boldsymbol{nu }}}}}_{n}^{m}right)={f}^{m}({{{{boldsymbol{x}}}}}_{n}^{m};{{{{boldsymbol{phi }}}}}^{m}),{{{rm{for}}}},min {{{{mathcal{M}}}}}_{n}}$$
(13)
$$displaystyle {widetilde{q}}_{{{{boldsymbol{phi }}}}}({{{boldsymbol{z}}}}| {{{{boldsymbol{x}}}}}_{n}^{m})=displaystyle{{mathrm{Normal}}},left[{{{boldsymbol{z}}}}| {{{{boldsymbol{mu }}}}}_{n}^{m},{{mathrm{diag}}},left({{{{boldsymbol{nu }}}}}_{n}^{m}right)right]{{{rm{for}}}},min {{{{mathcal{M}}}}}_{n}$$
(14)
where fs, with parameters ϕs, is the batch ID encoder neural network for generating the mean ({{{{boldsymbol{mu }}}}}_{n}^{s}) and variance ({{{{boldsymbol{nu }}}}}_{n}^{s}) of ({widetilde{q}}_{{{{boldsymbol{phi }}}}}({{{boldsymbol{z}}}}| {s}_{n})), and fm, with parameters ϕm, is the modality encoder neural network for generating the mean ({{{{boldsymbol{mu }}}}}_{n}^{m}) and variance ({{{{boldsymbol{nu }}}}}_{n}^{m}) of ({widetilde{q}}_{{{{boldsymbol{phi }}}}}({{{boldsymbol{z}}}}| {{{{boldsymbol{x}}}}}_{n}^{m})). The operator diag( ⋅ ) converts a vector into a diagonal matrix.
In Eq. (10), because qϕ(z|xn, sn) is proportional to the product of individual Gaussians (or ‘experts’), itself is a Gaussian whose mean μn and variance νn can be calculated using those of the individual Gaussians:
$$begin{array}{ll}{{{{boldsymbol{mu }}}}}_{n}&=left(displaystylefrac{{{{{boldsymbol{mu }}}}}_{n}^{s}}{{{{{boldsymbol{nu }}}}}_{n}^{s}}+mathop{sum}limits_{min {{{{mathcal{M}}}}}_{n}}frac{{{{{boldsymbol{mu }}}}}_{n}^{m}}{{{{{boldsymbol{nu }}}}}_{n}^{m}}right)odot {{{{boldsymbol{nu }}}}}_{n}\ {{{{boldsymbol{nu }}}}}_{n}&={left(1+ displaystylefrac{1}{{{{{boldsymbol{nu }}}}}_{n}^{s}}+mathop{sum}limits_{min {{{{mathcal{M}}}}}_{n}}frac{1}{{{{{boldsymbol{nu }}}}}_{n}^{m}}right)}^{-1}end{array}$$
(15)
where ⊙ is the Hadamard product.
In doing this, qϕ(z|xn, sn) is modularized into (M + 1) neural networks to handle (2M − 1) different modality combinations, increasing the model’s scalability. Similar to the modality decoders, we also share parameters of the last few layers of different modality encoders ({{{f}^{{,}m}}}_{min {{{mathcal{M}}}}}) (the gray parts of the encoders in Fig. 1b, middle) to improve generalization.
Handling missing features via padding and masking
For each modality, as different cells can have different feature sets (for example, genes for RNA modality), it is hard to use a fixed-size neural network to handle these cells. To remedy this, we first convert ({{{{boldsymbol{x}}}}}_{n}^{m}) of variable size into a fixed-size vector for inference. For modality m, let ({{{{mathcal{F}}}}}_{n}^{m}) be the features of cell n and let ({{{{mathcal{F}}}}}^{m}={bigcup }_{nin {{{mathcal{N}}}}}{{{{mathcal{F}}}}}_{n}^{m}) be the feature union of all cells. The missing features of cell n can then be defined as ({overline{{{{mathcal{F}}}}}}_{n}^{m}={{{{mathcal{F}}}}}^{m}setminus {{{{mathcal{F}}}}}_{n}^{m}). We pad ({{{{boldsymbol{x}}}}}_{n}^{m}) of size ({D}_{n}^{m}) with zeros corresponding to its missing features ({overline{{{{mathcal{F}}}}}}_{n}^{m}) through a zero-padding function h,
$$begin{array}{ll}{widetilde{{{{boldsymbol{x}}}}}}_{n}^{m}&=h({{{{boldsymbol{x}}}}}_{n}^{m})end{array}$$
(16)
where ({widetilde{{{{boldsymbol{x}}}}}}_{n}^{m}) is the zero-padded count vector of constant size ({D}^{m}=| {{{{mathcal{F}}}}}^{m}|). The modality encoding process is thus decomposed as
$$begin{array}{rcl}left({{{{boldsymbol{mu }}}}}_{n}^{m},{{{{boldsymbol{nu }}}}}_{n}^{m}right)&=& {{f}}^{m}({{{boldsymbol{x}}}}_{n}^{m}{;}{{{{boldsymbol{phi }}}}}^{m})\ &=&{widehat{{f}}}^{m}left({h}({{{{boldsymbol{x}}}}}_{n}^{m}){;}{{{{boldsymbol{phi }}}}}^{m}right)\ &=&{widehat{{f}}}^{m}left({widetilde{{{{boldsymbol{x}}}}}}_{n}^{m}{;}{{{{boldsymbol{phi }}}}}^{m}right)end{array}$$
(17)
where ({widehat{f}}^{{,}m}) is the latter part of the modality encoder to handle a fixed-size input ({widetilde{{{{boldsymbol{x}}}}}}_{n}^{m}). However, given the sampled latent variables {cn, un}, to calculate the likelihood ({p}_{{{{boldsymbol{theta }}}}}({{{{boldsymbol{x}}}}}_{n}^{m}| {{{{boldsymbol{c}}}}}_{n},{{{{boldsymbol{u}}}}}_{n})), we also need to generate a mean ({{{{boldsymbol{lambda }}}}}_{n}^{m}) of variable size for ({{{{boldsymbol{x}}}}}_{n}^{m}). To achieve this, we decompose the modality decoding process as
$$begin{array}{ll}{{{{boldsymbol{lambda }}}}}_{n}^{m}&={g}^{m}({{{{boldsymbol{c}}}}}_{n},{{{{boldsymbol{u}}}}}_{n};{{{{boldsymbol{theta }}}}}^{m})\ &={h}^{-1}left[{widehat{g}}^{m}({{{{boldsymbol{c}}}}}_{n},{{{{boldsymbol{u}}}}}_{n};{{{{boldsymbol{theta }}}}}^{m})right]\ &={h}^{-1}left({widetilde{{{{boldsymbol{lambda }}}}}}_{n}^{m}right)end{array}$$
(18)
where ({widehat{g}}^{m}) is the front part of the modality decoder to generate the mean ({widetilde{{{{boldsymbol{lambda }}}}}}_{n}^{m}) of fixed-size Dm, and h−1 (the inverse function of h) is the masking function to remove the padded missing features ({overline{{{{mathcal{F}}}}}}_{n}^{m}) from ({widetilde{{{{boldsymbol{lambda }}}}}}_{n}^{m}) to generate ({{{{boldsymbol{lambda }}}}}_{n}^{m}). Note that ({widetilde{{{{boldsymbol{lambda }}}}}}_{n}^{m}) can also be taken as the imputed values for downstream analyses (see ‘Imputation for missing modalities and features’ and ‘Batch correction via latent variable manipulation’).
Self-supervised modality alignment
To achieve cross-modal inference in downstream tasks, we resort to aligning different modalities in the latent space. Leveraging self-supervised learning, we first use each cell’s multimodal observation ({{{{{{{boldsymbol{x}}}}}_{n}^{m}}}_{min {{{{mathcal{M}}}}}_{n}},{s}_{n}}) to construct unimodal observations ({{{{{{boldsymbol{x}}}}}_{n}^{m},{s}_{n}}}_{min {{{{mathcal{M}}}}}_{n}}), each of which is associated with the latent variables zm = {cm, um}. We then construct a pretext task, which enforces modality alignment by regularizing on the joint space of unimodal variational posteriors with the dispersion of latent variables as a penalty (Fig. 1b, top right), corresponding to a modality alignment loss
$$begin{array}{rcl}displaystyle{l}^{{{{rm{mod}}}}}({{{boldsymbol{phi }}}};{{{{boldsymbol{x}}}}}_{n},{s}_{n})&triangleq & alpha int,v(widetilde{{{{boldsymbol{z}}}}}){q}_{{{{boldsymbol{phi }}}}}(widetilde{{{{boldsymbol{z}}}}}| {{{{boldsymbol{x}}}}}_{n},{s}_{n})dwidetilde{{{{boldsymbol{z}}}}}\ displaystyle &=&alpha ,{{mathbb{E}}}_{{q}_{{{{boldsymbol{phi }}}}}(widetilde{{{{boldsymbol{z}}}}}| {{{{boldsymbol{x}}}}}_{n},{s}_{n})}v(widetilde{{{{boldsymbol{z}}}}})end{array}$$
(19)
where α > 0 is the loss weight, (widetilde{{{{boldsymbol{z}}}}}={{{{{{boldsymbol{z}}}}}^{m}}}_{min {{{{mathcal{M}}}}}_{n}}) is the set of latent variables, and ({q}_{{{{boldsymbol{phi }}}}}(widetilde{{{{boldsymbol{z}}}}}| {{{{boldsymbol{x}}}}}_{n},{s}_{n})) represents the joint distribution of unimodal variational posteriors because
$$begin{array}{ll}{q}_{{{{boldsymbol{phi }}}}}(widetilde{{{{boldsymbol{z}}}}}| {{{{boldsymbol{x}}}}}_{n},{s}_{n})&={q}_{{{{boldsymbol{phi }}}}}({{{{{{boldsymbol{z}}}}}^{m}}}_{min {{{{mathcal{M}}}}}_{n}}| {{{{boldsymbol{x}}}}}_{n},{s}_{n})\ &=mathop{prod}limits_{min {{{{mathcal{M}}}}}_{n}}{q}_{{{{boldsymbol{phi }}}}}({{{{boldsymbol{z}}}}}^{m}| {{{{boldsymbol{x}}}}}_{n},{s}_{n})\ &=mathop{prod}limits_{min {{{{mathcal{M}}}}}_{n}}{q}_{{{{boldsymbol{phi }}}}}({{{{boldsymbol{z}}}}}^{m}| {{{{boldsymbol{x}}}}}_{n}^{m},{s}_{n})end{array}$$
(20)
In Eq. (19), (v(widetilde{{{{boldsymbol{z}}}}})) is the sum of squared deviations, which measures the dispersion among different elements in (widetilde{{{{boldsymbol{z}}}}}) and is used to regularize ({q}_{{{{boldsymbol{phi }}}}}(widetilde{{{{boldsymbol{z}}}}}| {{{{boldsymbol{x}}}}}_{n},{s}_{n})); it is defined as
$$begin{array}{ll}v(widetilde{{{{boldsymbol{z}}}}})&triangleq mathop{sum}limits_{min {{{{mathcal{M}}}}}_{n}}parallel {{{{boldsymbol{z}}}}}^{m}-bar{{{{boldsymbol{z}}}}}{parallel }_{2}^{2}end{array}$$
(21)
where (bar{{{{boldsymbol{z}}}}}=frac{1}{| {{{{mathcal{M}}}}}_{n}| }{sum }_{min {{{{mathcal{M}}}}}_{n}}{{{{boldsymbol{z}}}}}^{m}) is the mean, and ∥ ⋅ ∥2 is the Euclidean distance.
Note that the computation of ({q}_{{{{boldsymbol{phi }}}}}({{{{boldsymbol{z}}}}}^{m}| {{{{boldsymbol{x}}}}}_{n}^{m},{s}_{n})) in Eq. (20) is efficient. Because ({q}_{{{{boldsymbol{phi }}}}}({{{{boldsymbol{z}}}}}^{m}| {{{{boldsymbol{x}}}}}_{n}^{m},{s}_{n})={q}_{{{{boldsymbol{phi }}}}}({{{boldsymbol{z}}}}| {{{{boldsymbol{x}}}}}_{n}^{m},{s}_{n}){| }_{{{{boldsymbol{z}}}}={{{{boldsymbol{z}}}}}^{m}}), according to Eq. (10), we have
$$begin{array}{ll}{q}_{{{{boldsymbol{phi }}}}}({{{boldsymbol{z}}}}| {{{{boldsymbol{x}}}}}_{n}^{m},{s}_{n})&propto p({{{boldsymbol{z}}}}){widetilde{q}}_{{{{boldsymbol{phi }}}}}({{{boldsymbol{z}}}}| {s}_{n}){widetilde{q}}_{{{{boldsymbol{phi }}}}}({{{boldsymbol{z}}}}| {{{{boldsymbol{x}}}}}_{n}^{m})\ end{array}$$
(22)
As the mean and covariance of each Gaussian term on the righthand side of Eq. (22) was already obtained when inferring qϕ(z|xn, sn) (Eq. (10)), the mean and covariance of ({q}_{{{{boldsymbol{phi }}}}}({{{boldsymbol{z}}}}| {{{{boldsymbol{x}}}}}_{n}^{m},{s}_{n})) can be directly calculated using Eq. (15), avoiding the need of passing each constructed unimodal observation to the encoders.
Information-theoretic disentanglement of latent variables
To better disentangle the biological state c and the technical noise u, we adopt an information-theoretic approach, the information bottleneck (IB)34, to control information flow during inference. We define two types of IB, where the technical IB prevents batch-specific information being encoded into c by minimizing the mutual information (MI) between s and c, and the biological IB prevents biological information being encoded into u by minimizing the MI between x and u (Fig. 1b, bottom right). Let I( ⋅ , ⋅ ) denote the MI between two variables. We implement both IBs by minimizing the weighted sum of I(s, c) and I(x, u),
$$begin{array}{ll}displaystyle{beta }^{s}{{{rm{I}}}}(s,{{{boldsymbol{c}}}})+{beta }^{x}{{{rm{I}}}}({{{boldsymbol{x}}}},{{{boldsymbol{u}}}})=displaystyle{beta }^{s}{{mathbb{E}}}_{displaystyle p(s,{{{boldsymbol{c}}}})}left[log frac{p(s,{{{boldsymbol{c}}}})}{p(s)p({{{boldsymbol{c}}}})}right]+{beta }^{x}{{mathbb{E}}}_{displaystyle p({{{boldsymbol{x}}}},{{{boldsymbol{u}}}})}left[log frac{p({{{boldsymbol{x}}}},{{{boldsymbol{u}}}})}{p({{{boldsymbol{x}}}})p({{{boldsymbol{u}}}})}right]\ approx frac{1}{N}mathop{sum}limits_{n}left[{beta }^{s}{{mathbb{E}}}_{displaystyle{q}_{{{{boldsymbol{phi }}}}}({{{boldsymbol{c}}}}| {{{{boldsymbol{x}}}}}_{{n}},{displaystyle{s}}_{{n}})}left[log {p}_{widehat{{{{boldsymbol{eta }}}}}}({s}_{n}| {{{boldsymbol{c}}}})right]+{beta }^{x}{{{rm{KL}}}}left[{q}_{{{{boldsymbol{phi }}}}}({{{boldsymbol{u}}}}| {{{{boldsymbol{x}}}}}_{n}, {displaystyle s}_{n})parallel p({{{boldsymbol{u}}}})right]right.\ left.-{beta }^{x}{{mathbb{E}}}_{displaystyle{q}_{{{{boldsymbol{phi }}}}}({{{boldsymbol{u}}}}| {{{{boldsymbol{x}}}}}_{n},{displaystyle{s}}_{n})}left[log {p}_{{{{boldsymbol{theta }}}}}({s}_{n}| {{{boldsymbol{u}}}})right]right]+{{{rm{const.}}}}end{array}$$
(23)
where βs, βx > 0 are the weights, and ({p}_{widehat{{{{boldsymbol{eta }}}}}}(s| {{{boldsymbol{c}}}})) is a learned likelihood with parameters (widehat{{{{boldsymbol{eta }}}}}) (see Supplementary Note 4 for the detailed derivation). Minimizing (left[{beta }^{s}I(s,{{{boldsymbol{c}}}})+{beta }^{x}I({{{boldsymbol{x}}}},{{{boldsymbol{u}}}})right]) is thus approximately equal to minimizing the IB loss ({widehat{l}}^{{{{rm{IB}}}}}) with respect to ϕ for all cells,
$$begin{array}{l}displaystyle{widehat{l}}^{{{{rm{IB}}}}}({{{boldsymbol{phi }}}};{{{{boldsymbol{x}}}}}_{n},{s}_{n},widehat{{{{boldsymbol{eta }}}}})\ displaystyle triangleq {beta }^{s}{{mathbb{E}}}_{{q}_{{{{boldsymbol{phi }}}}}({{{boldsymbol{c}}}}| {{{{boldsymbol{x}}}}}_{n},{s}_{n})}left[log {p}_{widehat{{{{boldsymbol{eta }}}}}}({s}_{n}| {{{boldsymbol{c}}}})right]+{beta }^{x}{{{rm{KL}}}}left[{q}_{{{{boldsymbol{phi }}}}}({{{boldsymbol{u}}}}| {{{{boldsymbol{x}}}}}_{n},{s}_{n})parallel p({{{boldsymbol{u}}}})right]\ displaystyle -{beta }^{x}{{mathbb{E}}}_{{q}_{{{{boldsymbol{phi }}}}}({{{boldsymbol{u}}}}| {{{{boldsymbol{x}}}}}_{n},{s}_{n})}left[log {p}_{{{{boldsymbol{theta }}}}}({s}_{n}| {{{boldsymbol{u}}}})right]end{array}$$
(24)
For ({p}_{widehat{{{{boldsymbol{eta }}}}}}(s| {{{boldsymbol{c}}}})), we model it as ({p}_{widehat{{{{boldsymbol{eta }}}}}}(s| {{{boldsymbol{c}}}})={{mathrm{Categorical}}},left(s| {{{boldsymbol{kappa }}}}right)), where ({{{boldsymbol{kappa }}}}=r({{{boldsymbol{c}}}};widehat{{{{boldsymbol{eta }}}}})) is the probability vector and r is a classifier neural network parameterized by (widehat{{{{boldsymbol{eta }}}}}). To learn the classifier, we minimize the following expected negative log-likelihood with respect to (widehat{{{{boldsymbol{eta }}}}}),
$$begin{array}{rcl}displaystyle{{mathbb{E}}}_{p({{{boldsymbol{c}}}},s)}left[-log {p}_{widehat{{{{boldsymbol{eta }}}}}}(s| {{{boldsymbol{c}}}})right]&=&{{mathbb{E}}}_{p({{{boldsymbol{x}}}},s)}{{mathbb{E}}}_{p({{{boldsymbol{c}}}}| {{{boldsymbol{x}}}},s)}left[-log {p}_{widehat{{{{boldsymbol{eta }}}}}}(s| {{{boldsymbol{c}}}})right]\ displaystyle &approx & frac{1}{N}mathop{sum}limits_{n}{{mathbb{E}}}_{p({{{boldsymbol{c}}}}| {{{{boldsymbol{x}}}}}_{n},{s}_{n})}left[-log {p}_{widehat{{{{boldsymbol{eta }}}}}}({s}_{n}| {{{boldsymbol{c}}}})right]\ displaystyle &approx & frac{1}{N}mathop{sum}limits_{n}{{mathbb{E}}}_{{q}_{{{{boldsymbol{phi }}}}}({{{boldsymbol{c}}}}| {{{{boldsymbol{x}}}}}_{n},{s}_{n})}left[-log {p}_{widehat{{{{boldsymbol{eta }}}}}}({s}_{n}| {{{boldsymbol{c}}}})right]end{array}$$
(25)
from which we can define a classifier loss
$$begin{array}{rcl}displaystyle{widehat{l}}^{r}(widehat{{{{boldsymbol{eta }}}}};{{{{boldsymbol{x}}}}}_{n},{s}_{n},{{{boldsymbol{phi }}}})&triangleq & {{mathbb{E}}}_{{q}_{{{{boldsymbol{phi }}}}}({{{boldsymbol{c}}}}| {{{{boldsymbol{x}}}}}_{n},{s}_{n})}left[-log {p}_{widehat{{{{boldsymbol{eta }}}}}}({s}_{n}| {{{boldsymbol{c}}}})right]end{array}$$
(26)
To further enhance latent disentanglement for cross-modal inference, we also apply IBs on the data generated from our self-supervised tasks, that is, minimizing (left[{beta }^{s}I(s,{{{{boldsymbol{c}}}}}^{m})+{beta }^{x}I({{{{boldsymbol{x}}}}}^{m},{{{{boldsymbol{u}}}}}^{m})right]) for each modality m. Similar to Eqs. (23) and (24), this can be achieved by minimizing ({widehat{l}}^{{{{rm{IB}}}}}({{{boldsymbol{phi }}}};{{{{boldsymbol{x}}}}}_{n}^{m},{s}_{n},{{{{boldsymbol{eta }}}}}^{m})), where ηm is the parameters of the classifier neural network rm to generate the probability vector κm = rm(cm; ηm) for the likelihood ({p}_{{{{{boldsymbol{eta }}}}}^{m}}(s| {{{{boldsymbol{c}}}}}^{m})={{mathrm{Categorical}}},left(s| {{{{boldsymbol{kappa }}}}}^{m}right)). Together with the IB loss of Eq. (24), the total IB loss is defined as
$$begin{array}{l}{l}^{{{{rm{IB}}}}}({{{boldsymbol{phi }}}};{{{{boldsymbol{x}}}}}_{n},{s}_{n},{{{boldsymbol{eta }}}})triangleq {widehat{l}}^{{{{rm{IB}}}}}({{{boldsymbol{phi }}}};{{{{boldsymbol{x}}}}}_{n},{s}_{n},widehat{{{{boldsymbol{eta }}}}})+mathop{sum}limits_{min {{{{mathcal{M}}}}}_{n}}{widehat{l}}^{{{{rm{IB}}}}}({{{boldsymbol{phi }}}};{{{{boldsymbol{x}}}}}_{n}^{m},{s}_{n},{{{{boldsymbol{eta }}}}}^{m})end{array}$$
(27)
where ({{{boldsymbol{eta }}}}={widehat{{{{boldsymbol{eta }}}}},{{{{{{boldsymbol{eta }}}}}^{m}}}_{min {{{mathcal{M}}}}}}). To learn ({p}_{{{{{boldsymbol{eta }}}}}^{m}}(s| {{{{boldsymbol{c}}}}}^{m})), we can also minimize ({{mathbb{E}}}_{p({{{{boldsymbol{c}}}}}^{m},s)}left[-log {p}_{{{{{boldsymbol{eta }}}}}^{m}}(s| {{{{boldsymbol{c}}}}}^{m})right]), which corresponds to minimizing ({widehat{l}}^{r}({{{{boldsymbol{eta }}}}}^{m};{{{{boldsymbol{x}}}}}_{n}^{m},{s}_{n},{{{boldsymbol{phi }}}})) according to Eqs. (25) and (26). With the classifier loss of Eq. (26), we define the total classifier loss as
$$begin{array}{l}{l}^{r}({{{boldsymbol{eta }}}};{{{{boldsymbol{x}}}}}_{n},{s}_{n},{{{boldsymbol{phi }}}})={widehat{l}}^{r}(widehat{{{{boldsymbol{eta }}}}};{{{{boldsymbol{x}}}}}_{n},{s}_{n},{{{boldsymbol{phi }}}})+mathop{sum}limits_{min {{{{mathcal{M}}}}}_{n}}{widehat{l}}^{r}({{{{boldsymbol{eta }}}}}^{m};{{{{boldsymbol{x}}}}}_{n}^{m},{s}_{n},{{{boldsymbol{phi }}}})end{array}$$
(28)
Training MIDAS
To train the encoders and decoders of MIDAS, considering the training objectives defined in Eqs. (8), (19) and (27), we minimize the following objective with respect to {θ, ϕ} for all observations ({{{{{{boldsymbol{x}}}}}_{n},{s}_{n}}}_{nin {{{mathcal{N}}}}}):
$$begin{array}{l}{l}^{f,g}({{{boldsymbol{theta }}}},{{{boldsymbol{phi }}}};{{{{boldsymbol{x}}}}}_{n},{s}_{n},{{{boldsymbol{eta }}}})={l}^{{{{rm{ELBO}}}}}({{{boldsymbol{theta }}}},{{{boldsymbol{phi }}}};{{{{boldsymbol{x}}}}}_{n},{s}_{n})+{l}^{{{{rm{mod}}}}}({{{boldsymbol{phi }}}};{{{{boldsymbol{x}}}}}_{n},{s}_{n})+{l}^{{{{rm{IB}}}}}({{{boldsymbol{phi }}}};{{{{boldsymbol{x}}}}}_{n},{s}_{n},{{{boldsymbol{eta }}}})end{array}$$
(29)
Here, the loss lELBO is defined based on the negative of the ELBO of Eq. (8), that is,
$$begin{array}{rcl}displaystyle{l}^{{{{rm{ELBO}}}}}({{{boldsymbol{theta }}}},{{{boldsymbol{phi }}}};{{{{boldsymbol{x}}}}}_{n},{s}_{n})&triangleq &-{{mathbb{E}}}_{{q}_{{{{boldsymbol{phi }}}}}({{{boldsymbol{c}}}},{{{boldsymbol{u}}}}| {{{{boldsymbol{x}}}}}_{n},{s}_{n})}left[gamma log {p}_{{{{boldsymbol{theta }}}}}({s}_{n}| {{{boldsymbol{u}}}})+mathop{sum}limits_{min {{{{mathcal{M}}}}}_{n}}log {p}_{{{{boldsymbol{theta }}}}}({{{{boldsymbol{x}}}}}_{n}^{m}| {{{boldsymbol{c}}}},{{{boldsymbol{u}}}})right]\ displaystyle &&+{{{rm{KL}}}}left[{q}_{{{{boldsymbol{phi }}}}}({{{boldsymbol{c}}}},{{{boldsymbol{u}}}}| {{{{boldsymbol{x}}}}}_{n},{s}_{n})parallel p({{{boldsymbol{c}}}},{{{boldsymbol{u}}}})right]end{array}$$
(30)
where γ ≥ 1 is an additional weight that can be set to a higher value to encourage u to encode more batch-specific information. In Eq. (29), because the classifier parameters η are unknown and the learning of η depends on ϕ, as in Eq. (28), we iteratively minimize Eqs. (28) and (29) with stochastic gradient descent (SGD), forming the MIDAS training algorithm (Algorithm 1). To better guide the optimization of the IB loss in Eq. (29) for disentangling latent variables, we increase the number of updates (that is, with K > 1) of Eq. (28) for the classifier parameters η in each iteration to ensure that these classifiers stay close to their optimal solutions.
Algorithm 1. The MIDAS training algorithm.
Input: A single-cell multimodal mosaic dataset ({{{{{{boldsymbol{x}}}}}_{n},{s}_{n}}}_{nin {{{mathcal{N}}}}})
Output: Decoder parameters θ, encoder parameters ϕ and classifier parameters η
(1) Randomly initialize parameters {θ, ϕ, η}
(2) for iteration t = 1, 2, …, Tdo
(3) Sample a minibatch ({{{{{{boldsymbol{x}}}}}_{n},{s}_{n}}}_{nin {{{{mathcal{N}}}}}_{t}}) from the dataset, where ({{{{mathcal{N}}}}}_{t}subset {{{mathcal{N}}}})
(4) for step k = 1, 2, …, Kdo
(5) Freeze ϕ and update η via SGD with loss (frac{1}{| {{{{mathcal{N}}}}}_{t}| }{sum }_{nin {{{{mathcal{N}}}}}_{t}}{l}^{r}({{{boldsymbol{eta }}}};{{{{boldsymbol{x}}}}}_{n},{s}_{n},{{{boldsymbol{phi }}}})) ⊳ See Eq. (28)
(6) endfor
(7) Freeze η and update {θ, ϕ} via SGD with loss (displaystylefrac{1}{| {{{{mathcal{N}}}}}_{t}| }{sum }_{nin {{{{mathcal{N}}}}}_{t}}{l}^{f,g}({{{boldsymbol{theta }}}},{{{boldsymbol{phi }}}};{{{{boldsymbol{x}}}}}_{n},{s}_{n},{{{boldsymbol{eta }}}})) ⊳ See Eq. (29)
(8) endfor
Mosaic integration on latent space
A key goal of single-cell mosaic integration is to extract biologically meaningful low-dimensional cell embeddings from the mosaic data for downstream analysis, where the technical variations are removed. To achieve this, for each cell, we first use the trained MIDAS to infer the latent posterior qϕ(c, u|xn, sn) through Eq. (10), obtaining the mean ({{{{boldsymbol{mu }}}}}_{n}={{{{{boldsymbol{mu }}}}}_{n}^{c},{{{{boldsymbol{mu }}}}}_{n}^{u}}) and variance ({{{{boldsymbol{nu }}}}}_{n}={{{{{boldsymbol{nu }}}}}_{n}^{c},{{{{boldsymbol{nu }}}}}_{n}^{u}}). We then take the maximum a posteriori (MAP) estimation of {c, u} as the integration result on the latent space, which is the mean μn because qϕ(c, u|xn, sn) is Gaussian. Finally, we take ({{{{boldsymbol{mu }}}}}_{n}^{c}), the MAP estimation of c, as the cell embedding.
Imputation for missing modalities and features
Based on the MAP estimation ({{{{{boldsymbol{mu }}}}}_{n}^{c},{{{{boldsymbol{mu }}}}}_{n}^{u}}) inferred from the single-cell mosaic data (see ‘Mosaic integration on latent space’), it is straightforward to impute missing modalities and features. We first pass ({{{{{boldsymbol{mu }}}}}_{n}^{c},{{{{boldsymbol{mu }}}}}_{n}^{u}}) to the decoders to generate padded feature mean ({widetilde{{{{boldsymbol{lambda }}}}}}_{n}^{m}) for each modality (min {{{mathcal{M}}}}) via Eq. (18). We then sample from a Bernoulli distribution with mean ({widetilde{{{{boldsymbol{lambda }}}}}}_{n}^{{{{rm{ATAC}}}}}) to generate the imputed ATAC counts and from two Poisson distributions with means ({widetilde{{{{boldsymbol{lambda }}}}}}_{n}^{{{{rm{RNA}}}}}) and ({widetilde{{{{boldsymbol{lambda }}}}}}_{n}^{{{{rm{ADT}}}}}) to generate the imputed RNA and ADT counts, respectively.
Batch correction via latent variable manipulation
Besides performing mosaic integration on the latent space (see ‘Mosaic integration on latent space’), we can also perform it on the feature space, that is, imputing missing values and correcting batch effects for the count data. Mosaic integration on feature space is important because it is required by many downstream tasks, such as cell typing and trajectory inference.
With the latent variables’ MAP estimation ({{{{{boldsymbol{mu }}}}}_{n}^{c},{{{{boldsymbol{mu }}}}}_{n}^{u}}), we can perform imputation and batch correction simultaneously by manipulating the technical noise. Concretely, let ({{{{boldsymbol{c}}}}}_{n}={{{{boldsymbol{mu }}}}}_{n}^{c}) and ({{{{boldsymbol{u}}}}}_{n}={{{{boldsymbol{mu }}}}}_{n}^{u}). We first calculate the mean of un within each batch (bin {{{mathcal{B}}}})
$$begin{array}{ll}{bar{{{{boldsymbol{u}}}}}}_{b}&=frac{1}{| {{{{mathcal{N}}}}}_{b}| }mathop{sum}limits_{nin {{{{mathcal{N}}}}}_{b}}{{{{boldsymbol{u}}}}}_{n}end{array}$$
(31)
where ({{{{mathcal{N}}}}}_{b}subseteq {{{mathcal{N}}}}) is the set of cell IDs belonging to batch b. Next, we calculate the mean of ({bar{{{{boldsymbol{u}}}}}}_{b}) over all batches
$$begin{array}{ll}bar{{{{boldsymbol{u}}}}}&=frac{1}{B}mathop{sum}limits_{b}{bar{{{{boldsymbol{u}}}}}}_{b}end{array}$$
(32)
We then look for the batch b* with a mean ({bar{{{{boldsymbol{u}}}}}}_{{b}^{* }}) closest to (bar{{{{boldsymbol{u}}}}}) and treat ({bar{{{{boldsymbol{u}}}}}}_{{b}^{* }}) as the ‘standard’ technical noise, where
$$begin{array}{ll}{b}^{* }&=arg mathop{min }limits_{b}parallel {bar{{{{boldsymbol{u}}}}}}_{b}-bar{{{{boldsymbol{u}}}}}{parallel }_{2}end{array}$$
(33)
Finally, for each cell, we correct the batch effect by substituting un with ({bar{{{{boldsymbol{u}}}}}}_{{b}^{* }}) and pass ({{{{{boldsymbol{c}}}}}_{n},{bar{{{{boldsymbol{u}}}}}}_{{b}^{* }}}) to the decoders to generate imputed and batch-corrected data (similar to ‘Imputation for missing modalities and features’, but here we use ({{{{{boldsymbol{c}}}}}_{n},{bar{{{{boldsymbol{u}}}}}}_{{b}^{* }}}) instead of {cn, un} to correct the batch effect).
Model transfer via transfer learning
When MIDAS has been pretrained on a reference dataset, we can conduct model transfer to transfer the model’s learned knowledge to a query dataset through transfer learning; that is, on the query dataset, we fine-tune the pretrained model instead of train the model from scratch. Because, compared to the reference dataset, the query dataset can contain different numbers of batches collected from different platforms, the batch ID-related modules need to be redefined. Thus, during transfer learning, we reparameterize and reinitialize the batch ID encoder and decoder {fs, gs} and the batch classifiers ({r,{{{r}^{m}}}_{min {{{mathcal{M}}}}}}) and only fine-tune the modality encoders and decoders ({{{f}^{m},{g}^{m}}}_{min {{{mathcal{M}}}}}).
A core advantage of our model transfer scheme is that it can flexibly transfer the knowledge of multimodal data to various types of query datasets, even to those with fewer modalities, improving the de novo integration of single-cell data.
Label transfer via reciprocal reference mapping and kNN-based cell annotation
While model transfer implicitly transfers knowledge through model parameters, label transfer explicitly transfers knowledge in the form of data labels. These labels can be different kinds of downstream analysis results, such as cell types, cell cycles or pseudotime. Through accurate label transfer, we can not only avoid the expensive de novo integration and downstream analysis but also improve label quality.
Reciprocal reference mapping
Typically, the first step of label transfer is reference mapping, which aligns the query cells with the reference cells so that labels can be transferred reliably. For MIDAS, we can naively achieve reference mapping in two ways: (1) mapping the query data onto the reference space (that is, applying the model pretrained on the reference data to infer the biological states for the query data50,51) and (2) mapping the reference data onto the query space (that is, applying the model fine-tuned on the query data (see ‘Model transfer via transfer learning’) to infer the biological states for the reference data14,52). However, the first way suffers from the ‘generalization problem’ because the pretrained model is hard to generalize to the query data, which usually contains unseen technical variations, whereas the second way suffers from the ‘forgetting problem’ because the fine-tuned model may lose information learned on the reference data, affecting the inferred biological states.
To tackle both problems, we propose a reciprocal reference mapping scheme, where we fine-tune the pretrained model on the query dataset to avoid the generalization problem and meanwhile feed the model with the historical data sampled from the reference dataset to prevent forgetting. In doing this, the model can find a mapping suitable for both reference and query datasets and can then align them on the latent space by inferring their biological states.
kNN-based cell annotation with novel cell-type identification
Based on the aligned latent representations (embeddings), the kNN classifier is used to transfer the reference labels to the query dataset. When the query and reference datasets belong to the same tissue (for example, PBMCs), we train the kNN classifier using the reference embeddings and labels and then use it to classify the query cells.
However, if the query and reference datasets are from distinct tissues (for example, BMMCs versus PBMCs), we might encounter new cell types in the query dataset that are not present in the reference dataset. To address this issue, we propose a strategy for novel cell-type detection. Specifically, we assign the label ‘query’ to all the query cells and use the cell embeddings and labels from both the query and reference datasets to train the kNN classifier. Subsequently, we use the classifier to predict the class probabilities for the query cells.
To detect new cell types, we use a thresholding approach on the predicted probabilities of the ‘query’ class, that is, we leverage a Gaussian mixture model with two components to group the probabilities into two distinct clusters. This clustering process allows us to establish a suitable threshold for the probabilities. For the cluster with a higher mean, its cells have higher probabilities belonging to the ‘query’ class; we consider these cells as unique to the query dataset and assign them the label ‘unknown’. Conversely, for the cluster with a lower mean, its cells have lower probabilities belonging to the ‘query’ class; we consider these cells to belong to the types present in the reference dataset and assign each of these cells the label of the class with the highest predicted probability among all classes except the ‘query’ class.
The above kNN and Gaussian mixture model algorithms are implemented by the KNeighborsClassifier function (n_neighbors = 100 and weights = ‘uniform’) and the GaussianMixture function (n_components = 2 and tol = 10−4) from the scikit-learn58 (v1.2.2) Python package, respectively. Similar to model transfer (see ‘Model transfer via transfer learning’), in label transfer, knowledge can also be flexibly and accurately transferred to various types of query datasets.
Modality contribution to the integrated clustering
We assess the contribution of different modalities to clustering by measuring the agreement between single-modality clustering and multimodalities cell clustering. For each cell, the normalized consistency ratio of the nearest neighbors in the single modal clustering and multimodalities clustering is used to represent contribution of the modal for the final integrated clustering.
Regulatory network inference from scRNA-seq datasets
The GRNBoost2 algorithm59 from the Arboreto (v0.1.5) Python package is used to infer the regulatory network from scRNA-seq datasets. Weighted regulatory links between genes and transcription factors are provided from GRNBoost2. The weights of shared links from different data are compared to indicate the regulatory network retention.
Correlation of expression fold change values between raw and batch-corrected data
For each cell type, expression fold change values of genes and proteins are calculated against all other cells using the FoldChange function in the Seurat (v4.3.0) R package. The Pearson correlation coefficient is used to measure linear correlations of fold change values between raw and batch-corrected data.
Generating Seurat cell-type labels
To generate cell-type labels for both qualitative and quantitative evaluation, we used the third-party tool Seurat to annotate cell types for different datasets through label transfer. We took the CITE-seq PBMC atlas from Hao et al.15 as the reference set and used the FindTransferAnchors and TransferData functions in Seurat to perform label transfer, where ‘cca’ was used as the reduction method for reference mapping. For cells without raw RNA expression, we first used ATAC data to create a gene activity matrix using the GeneActivity function in the Signac60 (v1.9.0) R package. The gene activity matrix was subsequently used for label transfer.
Evaluation metrics
To evaluate the performance of MIDAS and the state-of-the-art tools on multimodal integration, we use metrics from scIB on batch correction and biological conservation and also propose our own metrics on modality alignment to better evaluate mosaic integration, extending scIB to scMIB (Supplementary Table 4). Because mosaic integration should generate low-dimensional representations and the imputed and batch-corrected data, scMIB is performed on both embedding space and feature space. To evaluate the batch correction and biological conservation metrics on the feature space, we convert the imputed and batch-corrected feature into a similarity graph via the PCA+WNN strategy (see ‘Implementation of comparing methods’) and then use this graph for evaluation. Our metrics for batch correction, modality alignment and biological conservation are defined as follows.
Batch correction metrics
The batch correction metrics comprise graph integration local inverse Simpson’s index (iLISI; ({y}_{{{{rm{embed}}}}}^{{{{rm{i}}}}{{{rm{LISI}}}}}) and ({y}_{{{{rm{feat}}}}}^{{{{rm{i}}}}{{{rm{LISI}}}}})), graph connectivity (({y}_{{{{rm{embed}}}}}^{{{{rm{gc}}}}}) and ({y}_{{{{rm{feat}}}}}^{{{{rm{gc}}}}})) and kNN batch effect test (kBET; ({y}_{{{{rm{embed}}}}}^{{{{rm{k}}}}{{{rm{BET}}}}}) and ({y}_{{{{rm{feat}}}}}^{{{{rm{k}}}}{{{rm{BET}}}}})), where ({y}_{{{{rm{embed}}}}}^{{{{rm{i}}}}{{{rm{LISI}}}}}), ({y}_{{{{rm{embed}}}}}^{{{{rm{gc}}}}}) and ({y}_{{{{rm{embed}}}}}^{{{{rm{k}}}}{{{rm{BET}}}}}) are defined in embedding space and ({y}_{{{{rm{feat}}}}}^{{{{rm{i}}}}{{{rm{LISI}}}}}), ({y}_{{{{rm{feat}}}}}^{{{{rm{gc}}}}}) and ({y}_{{{{rm{feat}}}}}^{{{{rm{k}}}}{{{rm{BET}}}}}) are defined in feature space.
Graph iLISI
The graph iLISI metric is extended from the iLISI61, which is used to measure the batch mixing degree. The iLISI scores are computed based on kNN graphs by computing the inverse Simpson’s index for diversity. The scores estimate the effective number of batches present in the neighborhood. iLISI ranges from 1 to N, where N equals the number of batches. Scores close to the real batch numbers denote good mixing. However, the typical iLISI score is not applicable to graph-based outputs. scIB proposed the graph iLISI, which uses a graph-based distance metric to determine the nearest neighbor list and avoids skews on graph-based integration outputs. The graph iLISI scores are scaled to [0, 1], where 0 indicates strong separation, and 1 indicates perfect mixing.
Graph connectivity
Graph connectivity is proposed by scIB to inspect whether cells with the same label are connected in the kNN graph of all cells. For each label c, we get the largest connected graph component of c-labeled cells and divide the largest connected graph component size by the population size of c-labeled cells to represent the graph connectivity for cell label c. We then calculate the connectivity values for all labels and take the average as the total graph connectivity. The score ranges from 0 to 1. A score of 1 means that all cells with the same cell identity from different batches are connected in the integrated kNN graph, which also indicates the perfect batch mixing and vice versa.
kBET
The kBET62 is used to measure batch mixing at the local level of the kNN. Certain fractions of random cells are repeatedly selected to test whether the local label distributions are statistically similar to the global label distributions (null hypothesis). The kBET value is the rejection rate over all tested neighborhoods, and values close to 0 indicate that the batches are well mixed. scIB adjusts the kBET with a diffusion-based correction to enable unbiased comparison on graph- and non-graph-based integration results. kBET values are first computed for each label and then averaged and subtracted from 1 to get a final kBET score.
Modality alignment metrics
The modality alignment metrics comprise modality averaged silhouette width (ASW; yASW), fraction of samples closer than the true match (FOSCTTM; yFOSCTTM), label transfer F1 (yltF1), ATAC area under the receiver operating characteristic (AUROC; yAUROC), RNA Pearson’s r (yRNAr) and ADT Pearson’s r (yADTr), where yASW, yFOSCTTM and yltF1 are defined in embedding space, and yAUROC, yRNAr and yADTr are defined in feature space.
Modality ASW
The modality ASW is used to measure the alignment of distributions between different modality embeddings. The ASW63 is originally used to measure the separation of clusters. In scIB, ASW is also modified to measure the performance of batch effect removal, resulting in a batch ASW that ranges from 0 to 1, where 1 denotes perfect batch mixing, and 0 denotes strong batch separation. By replacing batch embeddings with modality embeddings, we can define a modality ASW in the same manner as the batch ASW, where 1 denotes perfect modality alignment, and 0 denotes strong modality separation. For MIDAS, the modality embeddings are generated by feeding the trained model with each modality individually.
FOSCTTM
The FOSCTTM64 is used to measure the alignment of values between different modality embeddings. Let ({y}_{{m}_{1},{m}_{2}}^{{{{rm{FOSCTTM}}}}}) be the FOSCTTM for a modality pair {m1, m2}; it is defined as
$$begin{array}{ll}{y}_{{m}_{1},{m}_{2}}^{{{{rm{FOSCTTM}}}}}&=frac{1}{2N}left(mathop{sum}limits_{i}frac{{N}_{i}^{{m}_{1}}}{N}+mathop{sum}limits_{i}frac{{N}_{i}^{{m}_{2}}}{N}right)\ {N}_{i}^{{m}_{1}}&=| left{j| parallel {{{{boldsymbol{e}}}}}_{i}^{{m}_{1}}-{{{{boldsymbol{e}}}}}_{j}^{{m}_{2}}{parallel }_{2}
(34)
where N is the number of cells, i and j are the cell indices, and ({{{{boldsymbol{e}}}}}_{i}^{{m}_{1}}) and ({{{{boldsymbol{e}}}}}_{i}^{{m}_{2}}) are the embeddings of cell i in modalities m1 and m2, respectively. ({N}_{i}^{{m}_{1}}) is the number of cells in modality m2 that are closer to ({{{{boldsymbol{e}}}}}_{i}^{{m}_{1}}) than ({{{{boldsymbol{e}}}}}_{i}^{{m}_{2}}) is to ({{{{boldsymbol{e}}}}}_{i}^{{m}_{1}}), and it is similar for ({N}_{i}^{{m}_{2}}). We first get the embeddings of individual modalities, calculate the FOSCTTM values for each modality pair and then average these values and subtract it from 1 to obtain a final FOSCTTM score. Higher FOSCTTM scores indicate better modality alignment.
Label transfer F1
The label transfer F1 is used to measure the alignment of cell types between different modality embeddings. This can be achieved by testing whether cell-type labels can be transferred from one modality to another without any bias. For each pair of modalities, we first build a kNN graph between their embeddings and then transfer labels from one modality to the other based on the nearest neighbors. The transferred labels are compared to the original labels by the micro F1 score, which is defined as the label transfer F1. We take the F1 score averaged from all comparison pairs as the final label transfer F1 score.
ATAC AUROC
The ATAC AUROC is used to measure the alignment of different modalities in the ATAC feature space. It has been previously used to evaluate the quality of ATAC predictions65. For each method to be evaluated, we first use it to convert different modality combinations that do not contain ATAC into ATAC features, respectively, calculate the AUROC of each converted result by taking the true ATAC features as the ground truth and finally take the average of these AUROCs as the final score. Taking MIDAS as an example, if ATAC, RNA and ADT data are involved, the evaluation is based on the combinations {RNA}, {ADT} and {RNA, ADT}. For each combination, we feed the data into the trained model to generate the imputed data of all modalities {ATAC, RNA, ADT} (see, ‘Imputation for missing modalities and features’), where the generated ATAC features are used for AUROC calculation.
RNA Pearson’s r
The RNA Pearson’s r value is used to measure the alignment of different modalities in the RNA feature space. For each method to be evaluated, we first use it to convert different modality combinations that do not contain RNA into RNA features, respectively, calculate the Pearson’s r value between each converted result and the true RNA features and finally take the average of these Pearson’s r values as the final score.
ADT Pearson’s r
The ADT Pearson’s r value is used to measure the alignment of different modalities in the ADT feature space. The calculation of the ADT Pearson’s r value is similar to that of the RNA Pearson’s r value.
Biological conservation metrics
The biological conservation metrics comprise normalized MI (NMI; ({y}_{{{{rm{embed}}}}}^{{{{rm{NMI}}}}}) and ({y}_{{{{rm{feat}}}}}^{{{{rm{NMI}}}}})), adjusted Rand index (ARI; ({y}_{{{{rm{embed}}}}}^{{{{rm{ARI}}}}}) and ({y}_{{{{rm{feat}}}}}^{{{{rm{ARI}}}}})), isolated label F1 (({y}_{{{{rm{embed}}}}}^{{{{rm{il}}}}{{{rm{F}}}}1}) and ({y}_{{{{rm{feat}}}}}^{{{{rm{il}}}}{{{rm{F}}}}1})) and graph cell-type LISI (cLISI; ({y}_{{{{rm{embed}}}}}^{{{{rm{c}}}}{{{rm{LISI}}}}}) and ({y}_{{{{rm{feat}}}}}^{{{{rm{c}}}}{{{rm{LISI}}}}})), where ({y}_{{{{rm{embed}}}}}^{{{{rm{NMI}}}}}), ({y}_{{{{rm{embed}}}}}^{{{{rm{ARI}}}}}), ({y}_{{{{rm{embed}}}}}^{{{{rm{il}}}}{{{rm{F}}}}1}) and ({y}_{{{{rm{embed}}}}}^{{{{rm{c}}}}{{{rm{LISI}}}}}) are defined in embedding space, and ({y}_{{{{rm{feat}}}}}^{{{{rm{NMI}}}}}), ({y}_{{{{rm{feat}}}}}^{{{{rm{ARI}}}}}), ({y}_{{{{rm{feat}}}}}^{{{{rm{il}}}}{{{rm{F}}}}1}) and ({y}_{{{{rm{feat}}}}}^{{{{rm{c}}}}{{{rm{LISI}}}}}) are defined in feature space.
NMI
The NMI is used to measure the similarity between two clustering results, namely the predefined cell-type labels and the clustering result obtained from the embeddings or the graph. Optimized Louvain clustering is used here according to scIB. The NMI scores are scaled to [0, 1], where 0 and 1 correspond to uncorrelated clustering and a perfect match, respectively.
ARI
The ARI also measures the overlap of two clustering results. The Rand index (RI66) considers not only cell pairs that are assigned in the same clusters but also ones in different clusters in the predicted (Louvain clustering) and true (cell-type) clusters. The ARI corrects the RI for randomly correct labels. An ARI of 1 represents a perfect match, and 0 represents random labeling.
Isolated label F1
scIB proposes the isolated label F1 score to evaluate integration performance, specifically focusing on cells with the label that is shared by few batches. Cell labels presented in the least number of batches are identified as isolated labels. The F1 score for measuring the clustering performance on isolated labels is defined as the isolated label F1 score. It reflects how well the isolated labels separate from other cell identities, ranging from 0 to 1, where 1 means all the isolated label cells and no others are grouped into one cluster.
Graph cLISI
The graph cLISI is similar to the graph iLISI but focuses on cell-type labels rather than batch labels. Unlike iLISI that highlights the mixing of groups, cLISI values the separation of groups61. The graph-adjusted cLISI is scaled to [0, 1], with a value of 0 corresponding to low cell-type separation and a value of 1 corresponding to strong cell-type separation.
Overall scores
scIB
We compute the scIB overall score using the batch correction and biological conservation metrics defined either on the embedding space (for algorithms generating embeddings or graphs) or the feature space (for algorithms generating batch-corrected features). Following Luecken et al.43, the overall score y is the sum of the averaged batch correction metric ybatch weighted by 0.4 and the averaged biological conservation metric ybio weighted by 0.6,
$$begin{array}{ll}{y}^{{{{rm{batch}}}}}&=({y}_{omega }^{{{{rm{i}}}}{{{rm{LISI}}}}}+{y}_{omega }^{{{{rm{gc}}}}}+{y}_{omega }^{{{{rm{k}}}}{{{rm{BET}}}}})/3\ {y}^{{{{rm{bio}}}}}&=({y}_{omega }^{{{{rm{NMI}}}}}+{y}_{omega }^{{{{rm{ARI}}}}}+{y}_{omega }^{{{{rm{il}}}}{{{rm{F}}}}1}+{y}_{omega }^{{{{rm{c}}}}{{{rm{LISI}}}}})/4\ y&=0.4cdot {y}^{{{{rm{batch}}}}}+0.6cdot {y}^{{{{rm{bio}}}}}end{array}$$
(35)
where ω = embed for embedding or graph outputs, and ω = feat for feature outputs.
scMIB
As an extension of scIB, the scMIB overall score y is computed from the batch correction, modality alignment and biological conservation metrics defined on both the embedding and feature space. It is the sum of the averaged batch correction metric ybatch weighted by 0.3, the averaged modality alignment metric ({y}^{{{{rm{mod}}}}}) weighted by 0.3 and the averaged biological conservation metric ybio weighted by 0.4:
$$begin{array}{ll}{y}^{{{{rm{batch}}}}}=({y}_{{{{rm{embed}}}}}^{{{{rm{i}}}}{{{rm{LISI}}}}}+{y}_{{{{rm{embed}}}}}^{{{{rm{gc}}}}}+{y}_{{{{rm{embed}}}}}^{{{{rm{k}}}}{{{rm{BET}}}}}+{y}_{{{{rm{feat}}}}}^{{{{rm{i}}}}{{{rm{LISI}}}}}+{y}_{{{{rm{feat}}}}}^{{{{rm{gc}}}}}+{y}_{{{{rm{feat}}}}}^{{{{rm{k}}}}{{{rm{BET}}}}})/6\ {y}^{{{{rm{mod}}}}}=({y}^{{{{rm{ASW}}}}}+{y}^{{{{rm{FOSCTTM}}}}}+{y}^{{{{rm{lt}}}}{{{rm{F}}}}1}+{y}^{{{{rm{AUROC}}}}}+{y}^{{{{rm{RNA}}}}{{{rm{r}}}}}+{y}^{{{{rm{ADT}}}}{{{rm{r}}}}})/6\ {y}^{{{{rm{bio}}}}}=({y}_{{{{rm{embed}}}}}^{{{{rm{NMI}}}}}+{y}_{{{{rm{embed}}}}}^{{{{rm{ARI}}}}}+{y}_{{{{rm{embed}}}}}^{{{{rm{il}}}}{{{rm{F}}}}1}+{y}_{{{{rm{embed}}}}}^{{{{rm{c}}}}{{{rm{LISI}}}}}+{y}_{{{{rm{feat}}}}}^{{{{rm{NMI}}}}}+{y}_{{{{rm{feat}}}}}^{{{{rm{ARI}}}}}+{y}_{{{{rm{feat}}}}}^{{{{rm{il}}}}{{{rm{F}}}}1}+{y}_{{{{rm{feat}}}}}^{{{{rm{c}}}}{{{rm{LISI}}}}})/8\ y=0.3cdot {y}^{{{{rm{batch}}}}}+0.3cdot {y}^{{{{rm{mod}}}}}+0.4cdot {y}^{{{{rm{bio}}}}}end{array}$$
(36)
Datasets
All datasets of human PBMCs were publicly available (Supplementary Table 1). Count matrices of gene unique molecular identifiers (UMIs), ATAC fragments and ADTs were downloaded for data analysis.
DOGMA dataset
The DOGMA dataset contains four batches profiled by DOGMA-seq, which measures RNA, ATAC and ADT data simultaneously. Trimodal data from this dataset were obtained from Gene Expression Omnibus (GEO)67 under accession ID GSE166188 (ref. 3).
TEA dataset
The TEA dataset contains five batches profiled by TEA-seq, which measures RNA, ATAC and ADT data simultaneously. Trimodal data from these batches were obtained from GEO under accession ID GSE158013 (ref. 4).
TEA Multiome dataset
The TEA Multiome dataset measuring paired RNA and ATAC data was obtained from GEO under accession ID GSE158013 (ref. 4). This dataset contains two batches profiled by 10x Chromium Single Cell Multiome ATAC + Gene Expression.
10x Multiome dataset
The 10x Multiome dataset measuring paired RNA and ATAC data was collected from 10x Genomics (https://www.10xgenomics.com/resources/datasets/)68,69,70,71.
ASAP dataset
The ASAP dataset was obtained from GEO under accession ID GSE156473 (ref. 3). Two batches profiled by ASAP-seq are used, which include ATAC and ADT data.
ASAP CITE dataset
The ASAP CITE dataset was obtained from GEO under accession ID GSE156473 (ref. 3). Two batches profiled by CITE-seq are used, which include RNA and ADT data.
WNN CITE dataset
The WNN CITE dataset measuring paired RNA and ADT data was obtained from https://atlas.fredhutch.org/nygc/multimodal-pbmc ref. 15. This dataset was profiled by CITE-seq. We selected the eight PBMC batches generated before the administration of HIV vaccine for integration.
BMMC mosaic dataset
The BMMC mosaic dataset included three batches. The ICA batch measuring RNA data was obtained from https://www.dropbox.com/s/xe5tithw1xjxrfs/ica_bone_marrow.h5?dl=0 (ref. 72), where the first batch (‘MantonBM1’) of the original data is used. The ASAP batch measuring ADT and ATAC data was obtained from GEO under accession ID GSE156477 (ref. 3). The CITE batch measuring RNA and ADT data was obtained from GEO under accession ID GSE128639 (ref. 13).
Data preprocessing
The count matrices of RNA and ADT were processed via Seurat. The ATAC fragment files were processed using Signac, and peaks were called via the Python package MACS2 (ref. 73; v2.2.7.1). We performed quality control separately for each batch. Briefly, metrics of detected gene number per cell, total UMI number, percentage of mitochondrial RNA reads, total protein tag number, total fragment number, transcription start site score and nucleosome signal were evaluated. We manually checked the distributions of these metrics and set customized criteria to filter low-quality cells in each batch. The number of cells that passed quality control in each batch is shown in Supplementary Table 1.
For each batch, we adopted common normalization strategies for RNA, ADT and ATAC data, respectively. Specifically, for RNA data, UMI count matrices are normalized and log transformed using the NormalizeData function in Seurat. For ADT data, tag count matrices are centered log ratio normalized using the NormalizeData function in Seurat. For ATAC data, fragment matrices are term frequency inverse document frequency normalized using the RunTFIDF function in Signac.
To integrate batches profiled by various technologies, we need to create a union of features for RNA, ADT and ATAC data, respectively. For RNA data, first, low-frequency genes are removed based on gene occurrence frequency across all batches. We then select 4,000 highly variable genes using the FindVariableFeatures function with default parameters in each batch. The union of these highly variable genes is ranked using the SelectIntegrationFeatures function, and the top 4,000 genes are selected. In addition, we also retain genes that encode proteins targeted by the antibodies. For ADT data, the union of antibodies in all batches is retained for data integration. For ATAC data, we used the reduce function in Signac to merge all intersecting peaks across batches and then recalculated the fragment counts in the merged peaks. The merged peaks are used for data integration.
The input data for MIDAS are UMI counts for RNA data, tag counts for ADT data and binarized fragment counts for ATAC data. For each modality, the union of features from all batches are used. Counts of missing features are set to 0. Binary feature masks are generated accordingly, where 1 and 0 denote presented and missing features, respectively.
Implementation of MIDAS
We implement the MIDAS architecture using PyTorch74. The sizes of the shared hidden layers for different modality encoders are set to 1,024–128, whereas the sizes of the shared hidden layers for different modality decoders are set to 128–1,024. Additionally, the sizes of the biological state and technical noise latent variables are set to 32 and 2, respectively (refer to Supplementary Fig. 1 and Supplementary Table 12 for details). Each hidden layer is constructed using four PyTorch modules: Linear, LayerNorm, Mish and Dropout. The input and output layers have different sizes depending on the datasets used (refer to Supplementary Table 13 for details). To effectively reduce the number of model parameters, similar to Wu et al.65, the input and reconstruction layers for the ATAC modality are both split into 22 independent, fully connected layers based on the genomic regions of different human chromosomes (excluding sex chromosomes).
To train MIDAS, we set the modality alignment loss weight (α) to 50, the technical IB loss weight (βs) to 30, the biological IB loss weight (βx) to 4 and the technical noise likelihood loss weight (γ) to 1,000. The number of updates (K) of the batch classifiers ({r,{{{r}^{m}}}_{min {{{mathcal{M}}}}}}) in each iteration is set to 3. We split the dataset into training and validation sets in a ratio of 95:5. The minibatch size is set to 256, and we use the AdamW75 optimizer with a learning rate of 10−4 for implementing SGD. We train the model for a maximum of 2,000 epochs and use early stopping to terminate training. The dropout rates for all hidden layers are set to 0.2. All the hyperparameter settings for MIDAS training are listed in Supplementary Table 14.
Implementation of comparing methods
We compared MIDAS with 19 recent methods on different trimodal and bimodal integration tasks (see Supplementary Table 3 for an overview). If a method cannot handle missing features for a certain modality, we used the feature intersection of different batches of that modality for integration. For a fair comparison, we set the size of the low-dimensional representations generated by each method to be 32, the same as that of the biological states inferred by MIDAS. For other settings of each method, if not specified, their default values were used. For trimodal rectangular integration tasks, because few methods are directly applicable to ATAC, RNA and ADT trimodal data, we decomposed the rectangular integration into two steps, that is, batch correction for each modality independently and modality fusion for all batch-corrected modalities. We then combined different batch correction and modality fusion methods to achieve rectangular integration, resulting in nine different strategies in total.
Methods compared in trimodal rectangular integration tasks
BBKNN+average
The Python package BBKNN76 (v1.5.1) is used for batch correction (embedding space), and graph averaging is used for modality fusion. For each batch, we use functions from Seurat to perform dimensionality reduction on the count data. We first use RunTFIDF and RunSVD functions to obtain the low-dimensional representation of ATAC data and then use NormalizeData, ScaleData and RunPCA functions to obtain the low-dimensional representations of RNA and ADT data, respectively. For the obtained low-dimensional representation of each modality, we use the bbknn function of the Scanpy77 (v1.9.1) Python package to remove the batch effect and obtain a similarity graph. Finally, we average the similarity graphs of different modalities to obtain the output.
Harmony+WNN
The R package Harmony61 (v0.1.1) is used for batch correction (embedding space), and the WNN algorithm15 of the Seurat package is used for modality fusion. We use the same processing method as BBKNN+average to obtain low-dimensional representations of different batches of ATAC, RNA and ADT data, respectively. For the obtained low-dimensional representation of each modality, we use the RunHarmony function of the Harmony package to remove batch effects. We then use Seurat’s FindMultiModalNeighbors function, that is, the WNN algorithm, to fuse the low-dimensional representations of different modalities to obtain the graph output.
LIGER+WNN
The R package LIGER78 (v1.0.0) is used for batch correction (embedding space), and WNN is used for modality fusion. For each batch, we use Seurat’s RunTFIDF and ScaleData functions for ATAC data normalization and the NormalizeData and ScaleData functions for RNA and ADT data normalization. For each modality, we then use the RunOptimizeALS and RunQuantileNorm functions of the LIGER package for dimensionality reduction and batch effect removal. Finally, we use the WNN algorithm FindMultiModalNeighbors function to fuse the low-dimensional representations of different modalities to obtain the graph output.
MOFA+
The R package MOFA+24 (v1.4.0) is used for simultaneous batch correction (embedding space) and modality fusion. We first use the same processing method as LIGER+WNN to normalize each modality separately and then use the run_mofa and get_factors functions of the MOFA+ package to achieve simultaneous batch effect removal and modality fusion on the normalized data, obtaining low-dimensional representations output.
PCA+WNN
Singular value decomposition is used for the dimensionality reduction of ATAC data, and principal component analysis is used for the dimensionality reduction of RNA and ADT data. No batch correction is applied. WNN is then used for modality fusion. We use the same processing method as BBKNN+average to obtain low-dimensional representations of different batches of ATAC, RNA and ADT data, respectively. We then use the WNN algorithm FindMultiModalNeighbors function to fuse the low-dimensional representations of different modalities to obtain the graph output.
Scanorama-embed+WNN
The Python package Scanorama78 (v1.7.2) is used for batch correction (embedding space), and WNN is used for modality fusion. For each modality, we use the integrate function from the Scanorama package for dimensionality reduction and batch effect removal. We then use the WNN algorithm FindMultiModalNeighbors function to fuse the low-dimensional representations of different modalities to obtain the graph output.
Scanorama-feat+WNN
Scanorama is used for batch correction (feature space), and WNN is used for modality fusion. For each modality, we perform batch correction using the correct function of the Scanorama package. For the batch-corrected count data, we then use the PCA+WNN strategy to get the graph output.
Seurat-CCA+WNN
Seurat’s CCA13 is used for batch correction (feature space), and WNN is used for modality fusion. For each modality, we use Seurat’s FindIntegrationAnchors function (reduction = ‘cca’), that is, the CCA algorithm, to anchor different batches and use its IntegrateData function to correct batch effects. For the batch-corrected count data, we then use the PCA+WNN strategy to get the graph output.
Seurat-RPCA+WNN
Seurat’s RPCA13 is used for batch correction (feature space), and WNN is used for modality fusion. It uses the same strategy as Seurat-CCA+WNN, except that the FindIntegrationAnchors function is applied with reduction = ‘rpca’.
Methods compared in trimodal mosaic integration tasks
Multigrate
The Python package Multigrate29 (v0.0.2) is available at https://github.com/theislab/multigrate. For data inputs, we took the intersection of genes in scRNA-seq data and proteins in ADT data across different batches. We processed the data using the default method of Multigrate. The values of parameters KL and integ were set to 0.1 and 3,000, respectively.
scMoMaT
The Python package scMoMaT27 (v0.2.0) is designed to integrate multimodal mosaic data. The code is available at https://github.com/PeterZZQ/scMoMaT. We take the same preprocessed data as MIDAS. For each modality, because scMoMaT does not handle missing features, we only use the intersected features of different batches of the preprocessed data for integration. We set the minibatch size to 0.1 × N for training, where N is the number of cells.
scVAEIT
The Python package scVAEIT26 is designed to integrate multimodal mosaic data. The code is available at https://github.com/jaydu1/scVAEIT. After filtering the low-quality cells and features as MIDAS did, we size normalized and log normalized the counts of genes and proteins separately while binarizing the peaks by changing all nonzero values to 1.
StabMap
The R package StabMap28 (v0.1.8) is designed to integrate single-cell data with non-overlapping features. The code is available at https://github.com/MarioniLab/StabMap. To select suitable highly variable features, we set different parameters for different modalities (mean > 0.01 and P ≤ 0.05 for RNA; mean > 0.25 and P ≤ 0.05 for ATAC; mean > 0.01 and P ≤ 0.1 for ADT). In the case of diagonal integration, because there are no shared features between different modalities, we convert the ATAC regions into the nearest genes using the ClosestFeature function in Signac and convert the ADT proteins into the corresponding genes. In addition, to obtain more shared features between different modalities in diagonal integration, we relaxed the conditions of highly variable features (mean > 0.01 and P ≤ 0.1 for RNA; mean > 0.25 and P ≤ 0.05 for ATAC; all features for ADT). In diagonal integration, we choose the RNA batch as the reference set; in other cases, we choose the batch with the largest number of modalities as the reference set.
Methods compared in bimodal (ATAC and RNA) integration tasks
Cobolt
The Python package Cobolt21 (v1.0.1) is designed to integrate bimodal mosaic data from ATAC and RNA data. The code is available at https://github.com/epurdom/cobolt. We take the same preprocessed data as MIDAS and retain the intersected features of each modality for different batches. For Cobolt to read, we store the preprocessed data of each modality in each batch as a SingleData object. We set the learning rate to 5 × 10−4.
MultiVI
MultiVI22 is designed to integrate bimodal mosaic data from ATAC and RNA data. The code is integrated into the Python package scvi-tools (v1.0.0), which is available at https://github.com/scverse/scvi-tools. We take the same preprocessed data as MIDAS. For each modality, we also retain intersected features of different batches. In the model setup, we use batch_key to specify the cell modality and use categorical_covariate_keys to specify the cell batch.
uniPort
The Python package uniPort23 (v1.2.2) is designed to integrate heterogeneous single-cell bimodal data. The code is available at https://github.com/caokai1073/uniPort. Because uniPort supports horizontal, vertical and diagonal integration, we combine all three integration methods to achieve our tasks.
GLUE
The Python package GLUE25 (v0.3.2) is designed for integrating unpaired multimodal data (for example, scRNA-seq data, scATAC-seq data and snmC-seq data) using graph-linked unified embeddings. Due to GLUE’s inability to handle trimodal integration with ADT data, we limited the evaluation to bimodal ATAC and RNA integration tasks. The code is available at https://github.com/gao-lab/GLUE, whereas the GTF file we used in the experiments can be obtained from https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_43/gencode.v43.annotation.gtf.gz. To remove batch effects, we set use_batch = ‘batch’ in the experiments.
Methods compared in bimodal (RNA and ADT) integration tasks
totalVI
totalVI18 is designed to integrate bimodal mosaic data from RNA and ADT data. The code is integrated into the Python package scvi-tools (v1.0.0). As totalVI does not handle missing genes, we took the intersection of genes in RNA data from different input batches. For the ADT data, the union of proteins from different batches is used.
sciPENN
The Python package sciPENN19 (v1.0.0) is designed to integrate bimodal data from RNA and ADT data and is available at https://github.com/jlakkis/sciPENN. Because sciPENN cannot handle missing genes, we took the intersection of RNA features and the union of ADT features for different input batches.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
>>> Read full article>>>
Copyright for syndicated content belongs to the linked Source : Nature.com – https://www.nature.com/articles/s41587-023-02040-y