Main
A wide range of human diseases are associated with diverse genetic alterations that may be responsible for initiating, promoting or otherwise modifying the course of a given disease. These alterations can be quite complex; for instance, cancer genomes typically contain a repertoire of single nucleotide variants (SNVs) and large-scale copy number alterations that can impact many genes in different ways depending on the type of alteration, gene function and biological context. While tumor genotype is a well-established determinant of disease initiation, progression and therapy responses, the functional impact conferred by the thousands of unique mutations observed in human tumors remains poorly understood. This presents a major challenge to precision medicine efforts that aim to tailor cancer therapies to patients suffering from cancers harboring specific genetic lesions. Beyond the clinic, understanding the impact that diverse types of mutations have on different residues and protein domains would improve our fundamental understanding of gene and protein function (Fig. 1a).
Fig. 1: High-throughput design and evaluation of a TP53 prime editing sensor library.
a, Schematic of our overall approach. We aim to engineer variants observed in patients with high throughput to perform functional screens in diverse contexts, elucidating variant functions to improve our ability to stratify and treat patients. b, Schematic of the sensor framework, which links each pegRNA to its editing outcome at the endogenous locus. c, We used PEGG to generate a TP53 prime editing sensor library targeting>1,000 cancer-associated TP53 variants with a median of 30 pegRNAs per variant. d, Heatmap visualization of the pegRNAs included in the TP53 sensor library, which includes SNVs, indels and silent substitutions. e, Correlation between editing at the sensor and endogenous locus in eight TP53-targeting pegRNA-sensor pairs at day 3 (D3) and day 7 (D7) posttransduction. f, Schematic of the screening protocol. The prime editing sensor library is transduced into cells constitutively expressing PEmax, and screening is performed in the presence or absence of the small molecule Nutlin-3. g, The average correct editing percentage among all pegRNAs in the library (left) or when considering only the most efficient pegRNA for each variant (right) at various timepoints in both conditions for pegRNA-sensor pairs with at least 100 sequencing reads; n = 3 biologically independent replicates. Data are presented as mean values with a 95% confidence interval. h, Rank plot of the correct editing percentage of the most efficient pegRNA per variant, as assayed at the sensor locus, at each timepoint. Source data and code to reproduce this figure can be found at https://github.com/samgould2/p53-prime-editing-sensor/blob/main/figure1.ipynb. AD1/2, activation domain 1/2; BlastR, blasticidin selection marker; CTD, C-terminal domain; DEL, deletions; EIF1α, eukaryotic initiation factor 1 alpha; INS, insertions; nCas9, nicking Cas9; NLS, nuclear localization signal; PRR, proline rich region; Puro, puromycin selection marker; P2A, peptide 2ART, reverse transcriptase; STOP, U6 polyT termination sequence; tevo, tevopreQ1; U6, U6 promoter.
Until recently, approaches for studying genetic variants have been limited to low-throughput, homology-directed repair (HDR)-based methods or high-throughput, nonphysiological gene overexpression systems1,2,3,4,5,6,7. While powerful, the former approach lacks scalability and generality due to the requirements of HDR and its limitation primarily to actively dividing cells. Gene overexpression systems have fewer requirements and are scalable, but fail to physiologically recapitulate the biology driven by these variants due to the absence of endogenous gene regulation mechanisms, many of which are not known for genes of interest. The recent development of precision genome editing tools, including base editing and prime editing, allows variants to be modeled in their native, endogenous genomic context with increased editing efficiency and theoretically higher throughput8,9,10.
Prime editing10 can be used to generate effectively any type of small mutation, including all SNVs and small insertions and deletions (indels). Prime editors are directed to engineer a mutation of interest by the instructions encoded in a prime editing guide RNA (pegRNA), which contains both a protospacer (the ‘search’ sequence) and a 3′ extension sequence (the ‘replace’ sequence that dictates the mutation installed at the site). The modular search-and-replace ability of prime editing has been leveraged to interrogate endogenous variants in high-throughput methods11,12,13. In these approaches, libraries of pegRNAs are delivered transiently or stably to cells expressing prime editors, and the fitness of variants is assessed by determining the relative distribution of endogenous alleles and/or pegRNAs. While powerful, these approaches have important limitations for screening applications, including reliance on a small number of variant-specific pegRNAs with unknown editing performance, inability to quantitatively assess endogenous genome editing at scale, and potential overrepresentation of undesired indels due to using PE3, a prime editor system that uses an additional guide RNA that nicks the nonedited strand to increase editing efficiency10.
With these challenges in mind, we sought to develop an integrative computational and experimental framework for high-throughput design, screening and deconvolution of pegRNA libraries to interrogate a diverse spectrum of genetic variants. This includes pairing each pegRNA with a variant-specific synthetic ‘sensor’ site14 that recapitulates the native architecture of the endogenous target locus. This sensor-based approach links pegRNA identity to editing outcomes for simultaneous high-throughput quantification of pegRNA editing activity and empirical calibration of screening data.
We chose the p53 transcription factor as a prototype to test this approach for investigating the biological impact of specific genetic variants. Notably, TP53 is the most frequently mutated gene in cancer and exhibits extensive allelic variation, leading to the generation of altered proteins that can produce functionally distinct phenotypes. Whether distinct variants of TP53 (and other genes) encode proteins with differing functional activities that influence cancer phenotypes remains controversial and technically challenging to investigate, particularly at scale. Several studies have used orthogonal cDNA-based exogenous overexpression systems to probe the fitness of p53 variants in human, mouse and yeast systems6,7,15,16. However, given the artificial nature of these screens, which rely on expression of variants at supraphysiological levels, we hypothesized that these strategies could misrepresent one or more phenotypes associated with p53 variants. Artifacts that stem from exogenous overexpression systems could be particularly relevant when studying proteins like p53 because p53 functions as a tetramer whose expression and degradation is tightly controlled by the cell17,18,19. Thus, we reasoned that alterations to the stoichiometric balance of p53 via overexpression could lead to erroneous conclusions about the effects of particular p53 variants, including misclassifying certain variants as noncausal or otherwise benign.
To tackle this question, we generated and screened a library of>28,000 pegRNAs targeting>1,000 TP53 variants observed across>40,000 cancer patients20—the largest set of endogenous TP53 variants studied so far. We included SNVs, insertions and deletions observed in patients, putative neutral silent substitutions as controls and a panel of random indels to increase the functional search space. These experiments identified alleles that impact p53 function in mechanistically diverse ways. We discovered that certain types of endogenous variants, particularly those found in the p53 oligomerization domain (OD), display opposite phenotypes when tested with exogenous overexpression systems. Collectively, these results highlight the physiological importance of gene dosage in shaping native protein stoichiometry and protein–protein interactions, and establish a powerful computational and experimental framework for studying diverse types of genetic variants at scale. To ensure widespread accessibility of this resource for the scientific community, we provide a publicly available Python package, Prime Editing Guide Generator (PEGG) (https://pegg.readthedocs.io/en/latest/), as a tool to generate prime editing sensor libraries.
Results
High-throughput design of prime editing sensor libraries
A principal limitation of using prime editing to systematically investigate genetic variants is the inherent variability in editing efficiency among different pegRNAs10,21,22,23. A number of computational tools for pegRNA design have been developed24,25,26,27,28,29,30,31,32,33, including machine-learning algorithms that can nominate sets of pegRNAs predicted to produce high efficiency edits. However, even pegRNAs generated by these predictive algorithms require extensive experimental validation, and their editing activity is not guaranteed to correlate strongly across different cell types. We hypothesized that coupling pegRNAs with ‘sensors’—artificial copies of their endogenous target sites—would allow us to systematically identify high efficiency pegRNAs while controlling for the confounding effects of variable editing efficiency in a screening context (Fig. 1b).
Synthetic sensor-like target sites have been used previously by our group and others to control for base editing gRNA editing efficiencies while defining the relative fitness of variants in genetic screens14,34. Several studies have applied a similar strategy to both base and prime editing technologies to identify features of efficient gRNAs or pegRNAs and train predictive algorithms21,32,33,35,36,37. However, this approach has yet to be applied for high-throughput phenotypic screening of endogenous genetic variants with prime editing, probably due in part to the lower editing efficiency of prime editing relative to base editing. We reasoned that a sensor-based prime editing screening approach could be powerful to discriminate bona fide endogenous variants from undesired editing outcomes that enrich or deplete in a screen. Moreover, the sensor approach would theoretically overcome the limitations of assessing variants at different genetic sites in parallel by eliminating the need to sequence several endogenous loci.
To test this approach, we first needed to build a computational tool capable of designing and ranking pegRNAs for thousands of genetic variants, while automatically generating a paired sensor site. To address this unmet need, we built and publicly released PEGG (Extended Data Fig. 1a)—a Python package that enables high-throughput design of prime editing sensor libraries38 (available at https://pegg.readthedocs.io/en/latest/). PEGG is compatible with a range of mutation input formats, including all of the datasets on the cBioPortal, ClinVar identifiers and custom mutation inputs39,40.
We chose the TP53 tumor suppressor gene as a prototype to establish and credential a scalable prime editing sensor-based screening approach for a number of reasons. First, TP53 is the most frequently mutated gene in human cancer, with ~50% of patients suffering from tumors harboring a mutation within the TP53 gene while the rest often inactivate the p53 pathway through other mechanisms. Second, thousands of unique TP53 mutations have been identified in cancer patients, including eight or so ‘hotspot’ alleles in specific residues that exhibit the highest mutational frequencies19. Although p53 has been studied for decades, there have been few systematic studies, and those have been hampered by reliance on artificial overexpression of mutant p53 proteins, unrepresentative cell lines and/or a limited spectrum of mutations evaluated6,7,15,16. These and other studies have sparked controversy in the field over whether any mutant p53 proteins are endowed with activities that go beyond LOF or dominant negative activity to achieve GOF or neomorphic status. These are important questions that extend beyond TP53 because mutant GOF proteins generated by cancer-associated variants, and the phenotypes they produce, could represent attractive therapeutic targets. Finally, prime editing sensor-based screening could be scaled up and broadly deployed to identify causal genetic variants implicated in cancer and other diseases with a strong genetic association.
With the above goals in mind, we first sought to generate a library of pegRNAs targeting TP53 variants. To generate this library, we selected variants from the MSK-IMPACT database, which uses deep exon sequencing of patient tumor samples to identify cancer-associated variants20. From this database of over 40,000 patients, we chose all observed SNVs in p53, as well as frequently observed insertions and deletions, along with a collection of random indels (Extended Data Fig. 1b). We reasoned that including several pegRNAs with different protospacers and combinations of pegRNA properties for each variant would allow us to scan the pegRNA design space more thoroughly to identify highly efficient guides for robust statistical analysis of variant phenotypes. To accomplish this, we used PEGG to produce 30 pegRNA designs per variant (for pegRNAs with a sufficient number of accessible PAM sequences) with varying reverse transcription template (RTT) (10–30 nucleotides) and primer binding site (PBS) lengths (10–15 nucleotides) coupled to canonical ‘NGG’ protospacers. The generated pegRNA designs were ranked based on a composite ‘PEGG score’ that integrates literature best practices for pegRNA design (Extended Data Fig. 1a and Supplementary Table 1).
PEGG also generated silent substitution variants as neutral internal controls for the screen, and we filtered pegRNAs to exclude protospacers with an MIT specificity score below 50 to reduce the probability of off-target editing41 (Extended Data Fig. 1e). In addition, these pegRNA designs included an epegRNA motif—tevopreQ1—an RNA pseudoknot located at the 3′ end of the pegRNA that improves editing by preventing degradation of the guide22. Even after these relatively stringent filtration steps, we were able to generate pegRNA designs for more than 95% of the input variants, resulting in a library of>28,000 pegRNAs (Fig. 1c,d and Extended Data Fig. 1c,d). Each pegRNA in the library is also paired with a 60-nucleotide long variant-specific synthetic ‘sensor’ that is generated by PEGG and included in the final oligonucleotide design. Every sensor is designed to recapitulate the native endogenous target locus, thereby linking pegRNA identity to editing outcomes (Fig. 1b).
To test the efficacy of using the sensor as a readout of editing at the endogenous locus, we randomly selected eight TP53 variant-specific pegRNA sensors generated during the process of library preparation. We generated lentivirus for each of these prime editing sensor constructs and performed separate transductions into cells expressing PEmax. At 3- (3D) and 7-days posttransduction (D7), we harvested genomic DNA and amplified both the pegRNA–sensor cassette and the endogenous locus targeted by each pegRNA. Analysis of editing at the sensor and endogenous locus revealed a very high correlation between the sensors and endogenous sites (Spearman correlation ≥0.9; Fig. 1e). In general, the prime editing sensor seems to slightly overestimate the editing activity at the endogenous locus, probably in part due to differences in locus chromatin accessibility42, but the ranking of pegRNA editing efficiencies is largely preserved, validating our sensor-based approach.
High-throughput interrogation of TP53 variants
Next, we screened our library of variants in TP53 wild type (WT) A549 lung adenocarcinoma cells stably expressing PEmax21. To measure the prime editing activity of this cell line, we generated and transduced these cells with a modified all-in-one lentiviral version of the fluorescence-based PEAR reporter43, validating that the cells displayed strong editing activity (Extended Data Fig. 2a). We then introduced the lentiviral TP53 sensor library into these cells at a low multiplicity of infection and in triplicate while ensuring a library representation of>1,000× at every step of the sfcreen. At 4 days posttransduction (D4), we split the populations into untreated or Nutlin-3-treatment arms (Fig. 1f). Nutlin-3 is a small molecule that inhibits MDM2 to activate the p53 pathway, which can be used to select for TP53 mutations that promote bypass of p53-dependent cell cycle arrest and apoptosis44. We hypothesized that this treatment group may increase the signal-to-noise ratio between TP53 variants with putative loss-of-function (LOF) or gain-of-function (GOF) activities and benign variants. We allowed the screen to progress for 34 days (D34), harvesting cell pellets from each replicate and treatment arm at several timepoints (Extended Data Fig. 2b). Genomic DNA extracted from each sample was used to amplify the pegRNA–sensor cassettes, which were subjected to next-generation sequencing (NGS) to simultaneously assess enrichment/depletion of pegRNAs and their editing activity and outcomes at the sensor target site (Extended Data Fig. 2c,d).
The average editing efficiency among all pegRNAs in the library increased in a time-dependent manner, peaking at ~8% in the final timepoint. In general, we observed low indel rates and strong correlation in sensor editing among replicates (Fig. 1g and Extended Data Fig. 3a–d). Strikingly, selecting only the most efficient pegRNA design for each variant led to a twofold increase in the average editing efficiency, highlighting the utility of the sensor for systematic empirical identification of high efficiency pegRNAs (Fig. 1g and Extended Data Fig. 3e–g). Cells with higher editing efficiency also exhibited stronger Nutlin-3 bypass in the Nutlin-3-treatment arm (Fig. 1g). Based on the assessment of editing at the sensor locus, we were able to identify active pegRNAs (≥2% editing efficiency) for more than half of the TP53 variants included in the library. This includes highly efficient pegRNAs that install the desired edit with over 20% efficiency for more than 20% of the variants (Fig. 1h). These validated pegRNAs could be further engineered with silent mutations that evade mismatch repair to boost overall editing efficiency21.
The size and diversity of this library also allowed us to examine features of highly efficient pegRNAs that broadly recapitulated previous observations32,33,37,45. Correlation analysis between various pegRNA features and editing efficiency across all timepoints identified the estimated on-target activity of the protospacer (as predicted by Rule Set 2)46 as the single largest determinant of prime editing efficiency (Fig. 2a). In addition, the distance between the edit and the nick introduced by nCas9 was correlated negatively with editing efficiency, while the length of the postedit homology was correlated positively with editing efficiency (Fig. 2a). Thus, edits closer to the nick and with larger postedit homology were more efficient, consistent with previous findings32,33,37,45.
Fig. 2: Identification of features of highly efficient pegRNAs.
a, Spearman correlations between various features of pegRNA design and correct editing percentage, assessed for all pegRNA-sensor pairs with sufficient reads. Each dot represents a separate replicate/timepoint. For Doench 2016 score, see ref. 46. b, Relationship between PEGG score and average correct editing percentage at each timepoint and condition is increasing monotonically. c, Representative example of the correlation between PEGG score and editing efficiency for day 25 replicate 1 (D25-REP1) (Untreated). d, Visualization of the protospacer bias in editing efficiency. The number of pegRNAs generated per protospacer at each TP53 exon on the plus (+) or minus (−) strand (top) and the average editing efficiency at each of these protospacers at day 34 (D34) of the untreated condition (bottom). e, Average editing efficiency for SNV-generating pegRNAs in the library as a function of distance to the nick generated by PEmax and PBS length. The location of the ‘NGG’ PAM sequence is highlighted in blue. Protospacer disrupting (locations +1 to +3) and PAM-disrupting variants (locations +5 and +6) tend to be more efficient. f, Feature importance of 20 random forest models trained separately to predict pegRNA efficiency. Each dot represents a different model. Data are presented as mean values with a 95% confidence interval. Source data and code to reproduce this figure can be found at https://github.com/samgould2/p53-prime-editing-sensor/blob/main/figure2.ipynb. NUT, Nutlin-3-treated.
Notably, the PEGG Score, which is a weighted linear combination of pegRNA features based on literature best practices, correlated more strongly with prime editing efficiency than any other single feature, achieving a Spearman correlation of up to 0.4 (Fig. 2a–c). Although this correlation is modest relative to published predictive models32,33,37, the PEGG score is a simple, unbiased and cell type/organism-agnostic prediction of pegRNA activity that could complement machine-learning-based predictions of prime editing activity, which may vary due to training on particular cell types.
To further analyze the differences in prime editing activity among the 173 protospacers spanning the TP53 locus, we visualized the number of pegRNAs that utilized each protospacer and the average editing efficiency at each protospacer (Fig. 2d). This analysis suggests that only a subset of protospacers can be used to generate high efficiency pegRNAs, while other protospacers retain little-to-no editing activity. We also found that pegRNAs that introduce edits that disrupt the protospacer or PAM sequence tend to be more efficient (Fig. 2e). Relative to the nick created by nCas9, SNVs introduced at the +1–3 position, which mutate the protospacer, and at the +5–6 position, which mutate the guanine bases in the NGG PAM, display increased editing activity. In contrast, edits introduced at the +4 position, corresponding to the ‘N’ in the ‘NGG’ PAM sequence, display reduced editing, probably due to their failure to disrupt the PAM sequence (Fig. 2e).
Finally, we trained a random forest regressor to predict pegRNA efficiency (Extended Data Fig. 4a). Even with a restricted set of features, this algorithm was able to predict pegRNA activity with a Spearman correlation of ~0.6, comparable with other, more complex algorithms used to predict PE activity32,33,37 (Extended Data Fig. 4b). Analysis of the relative feature importance of this random forest model was again consistent with previous findings, and highlighted the GC content of the PBS as another important determinant of editing not identified with simple correlation analysis (Fig. 2f). These results demonstrate that large-scale, gene-specific prime editing sensor screening datasets can also provide insight into the determinants of high efficiency prime editing, even though these libraries were not designed with that objective in mind.
Sensor-based calibration identifies pathogenic TP53 variants
To assess the relative fitness conferred by engineered TP53 variants, we used the MAGeCK pipeline to normalize read counts among replicates and quantify the log2 fold change (LFC) and false discovery rate (FDR) of pegRNAs in the library47. While the LFC in pegRNA counts was highly correlated in replicates from the untreated and Nutlin-3-treated arms of the screen, respectively, the correlation among replicates between the two conditions was modest, suggesting that treatment-dependent biological effects were occurring (Extended Data Figs. 3a and 5). We then used the sensor target site as a quantitative proxy for editing efficiency at the endogenous locus to systematically filter pegRNAs based on their empirical editing efficiency and precision (Fig. 3a and Extended Data Fig. 6a–d). As expected, the number of significantly enriched or depleted pegRNAs in ‘sensor-calibrated’ datasets decreased as we increased the editing activity threshold (Fig. 3b). These results demonstrate that our sensor-based approach allows empirical removal of pegRNAs that exhibit potentially spurious enrichment or depletion, and low and/or undesired editing activity, retaining pegRNAs that are more likely to introduce the variants of interest with high efficiency and precision. Based on these results, we decided to focus our statistical analyses on a dataset composed of pegRNAs with ≥10% editing efficiency to minimize the confounding effects of imprecise editing (Fig. 3c–g).
Fig. 3: High-throughput prime editing sensor screens identify pathogenic TP53 variants.
a, Schematic of the sensor-calibrated filtration approach, where the editing rate of a pegRNA is determined by the sensor locus and pegRNAs below a given editing threshold are filtered. b, Number of significantly enriching or depleting pegRNAs (FDR
>>> Read full article>>>
Copyright for syndicated content belongs to the linked Source : Nature.com – https://www.nature.com/articles/s41587-024-02172-9