Main Body

Introduction

Next-generation sequencing (NGS ) technology has fundamentally altered how scientists identify the genetic basis of disease. Rather than sequencing and interrogating one gene at a time, clinicians and researchers now routinely sequence at a genome-wide level and, in turn, face a new problem: isolating the critical disease-causing, pathogenic variants from millions of others. While most variants are easily excluded upon the basis of frequency and gene biology, there are many variants of unknown/uncertain significance (VUS ) for which it is not evident whether a variant is pathogenic or benign.
The term VUS is an exceptionally loaded with definitions and guidelines coming from the American College of Medical Genetics (ACMG ) (Richards et al., 2018), Association for Molecular Pathology (AMP ) (M. M. Li et al., 2017), and International Agency for Research on Cancer (IARC ) (Plon et al., 2008). VUS have been the subject of entire special issues of this journal, Human Mutation , such as in 2008. The precise meaning of VUS depends upon context. For example, in clinical testing entire variant classes (such as synonymous variants) are presumed likely benign at baseline, ensuring VUS are less frequently reported. Still, VUS can take a significant psychological and emotional toll on patients and their families when reported and may lead to confusion and clinical mismanagement. Importantly, as we will discuss, VUS affect minority populations disproportionately (Culver et al., 2013; Gelfman et al., 2017; Maurano et al., 2012; Park et al., 2018; Qian et al., 2021; Vaz-Drago et al., 2017). Therefore, VUS are an issue with far-reaching impact, both in research and clinical reporting. At a simple level, the sheer numbers of intronic, synonymous, and other deep variants contribute to their overall predictive value as a variant class. Additionally, it is well-recognized that many important disease-causing variants fall into these neglected “opaque” areas of variant interpretation.
There is an overwhelming need for better tools capable of elucidating the functional consequences of rare molecular events and, thereby, illuminating how genetic variation impacts biology and disease pathogenesis. Analysis of DNA-level information alone in the context of clinical phenotypes is unlikely to clarify this information given rare events’ limited sample size and lack of statistical power. Compared to genomic analysis – which outlines the repertoire of functions encoded in DNA that a cell could  perform – transcriptomic analysis enables a deeper understanding of how a cell is actually  functioning. In other words, RNA expression profiling provides critical insight into how genotype translates into phenotype (Mantione et al., 2014).
Here we will highlight how gene expression profiling can be used to reveal the presence of deep intronic and splicing variants. We will review the emerging role of RNA sequencing as a tool to differentiate VUS as pathogenic vs. benign, especially in populations of diverse races, ethnicities, and genetic ancestries as individuals of non-European ancestry have significantly higher rates of VUS (Petrovski & Goldstein, 2016). Finally, we will explore the ways in which transcriptomic analysis can be translated into clinical diagnostics for cancer (i.e., detection of causative variants, both germline and somatic) and how it can be applied to address cancer disparities.

Leveraging the transcriptome

The term “RNA-seq” was coined in 2008 and, around this same time, several landmark studies hinted toward the potential diagnostic utility of RNA sequencing data using short-read NGS. One major effort was the ENCODE Project Consortium, which stated: “to understand the human genome… and the ways in which its defects can give rise to disease, we need a more transparent view of the information itencodes ” (Birney et al., 2007). An understanding of human disease necessitates an understanding of the link between genomic variation and transcription; this was the goal of the Genotype-Tissue Expression (GTEx ) project started in 2010 (Consortium, 2013). Analogous to the 1000 Genomes Project, which aimed to provide a catalogue of common variation, GTEx aimed to catalogue genotype and gene-expression correlations across dozens of tissues and hundreds (eventually thousands) of individuals.
An improved ability to predict the consequences of genetic variation on transcription looms as one of the largest opportunities and biggest challenges for addressing VUS. RNA analysis allows for the appreciation of phenomena like differential expression, allele-specific expression, alternative splicing, isoform switching, and many other events that would not be readily apparent in DNA data. Current approaches are largely limited to annotation of variant classes that are both rare and predictable for transcript disruption, such as nonsense mutations that lead to loss of transcript expression. More common variant classes are confounded by their higher prevalence and lower accuracy in predicting deleterious impact to gene expression.
Germline variants and somatic mutations can impact RNA species in a variety of ways, and they are often grouped into categories according to functional impact (i.e. High, Moderate, Low/Modifier) using annotation ontology software such as ANNOVAR (Wang et al., 2010; Yang & Wang, 2015) and SnpEff (Cingolani et al., 2012). Variants falling into the High  functional impact category include nonsense variants introducing premature stop-codons, stop-loss, 5’ untranslated region (UTR ) premature start codons, and start-loss variants. Many of these lead to lowered transcription, but also full removal of transcripts through nonsense-mediated decay (NMD ) (Chakravorty & Hegde, 2018). Another relatively rare variant class includes splice-site disrupting variants within the +/-2 base pair (bp ) region of the exon-intron boundaries, which can impact splicing and result in exon skipping, exon usage, mutually exclusive/inclusive exons, and/or intron retention (Woolfe et al., 2010). Variants within theModerate category typically are predicted to impact amino acid composition, including missense amino changes from non-synonymous variants and in-frame insertions/deletions.
The remainder of ontology variants are more frequently found per genome and are often pragmatically filtered out due to their low predictive value. These include synonymous variants, regulatory region variants, 3-prime (3’) UTR variants, 5-prime (5’) UTR variants, and intronic variants. It is essential to highlight splice region variants proximal to exon boundaries. Indeed, as shown in Figure 1A , variant alleles and exonic position well beyond the 2bp exon donor/acceptor window impact function. Variants annotated as missense and synonymous can alter splicing, exposing so-called cryptic splice-sites and splice enhancing or silencing elements (Woolfe et al., 2010). Moreover, deeper within the intron, variants as far out as 18 bp can impact spliceosome binding at a key motif referred to as the lariat junction.
Many of these low impact variants are found in higher numbers per genome and any given variant has a low probability of being functional. Still, there are numerous examples whereby variants drive disease via cryptic splice-sites (ex: HBB , IVS2+705T>G for β-thalasamia7), UTRs (ex: GJB1 in Charcot-Marie-Tooth neuropathy), and other deep intronic variants (ex: COL4A5IVS6+1873G>A in Alport Syndrome) (Dobkin et al., 1983; King et al., 2002). Frequently, the mechanism of impact is evident in transcriptome data. Indeed, novel RNA functions and phenomena have been elucidated that extend far beyond the traditional protein-coding role described by the “central dogma.” We are also beginning to better understand the ways in which messenger RNA (mRNA ) can be processed – chiefly by alternative splicing, RNA editing, and crosstalk between the two processes – to create the great variety that characterizes the human transcriptome (Tang et al., 2020). Examples of unique phenomena into which gene expression data provides insight, and which have significant disease implications, include splicing mutations, deep intronic truncating mutations leading to NMD and loss of mRNA, and allele-specific expression (ASE ).

Aberrant Splicing Mutations Beyond 2bp

Variants disrupting RNA splicing occur throughout the gene and, without RNA-based functional evidence, can be annotated into less obvious categories including as synonymous, intronic, and UTR variants. Functionally, splicing occurs via a complex of proteins and small nuclear RNAs called the spliceosome that form complementary RNA-RNA complexes with target RNAs (Anna & Monika, 2018). The spliceosome catalyzes the splicing of pre-mRNA into mRNA and circular lariat RNAs, the latter of which are destroyed (Talhouarne & Gall, 2018). Variants at the canonical boundary motifs between exons and introns are GT(at the 5’ end) and AG (at the 3’ end; Figure 1A ) are generally prioritized but, as shown in this figure, variants upstream and downstream can disrupt function at a lower certainty/frequency (Anna & Monika, 2018). Adding to the mechanistic complexity is the existence of a variety of spliceosomes (“major” and “minor”) as well as noncanonical — aka “cryptic” or “pseudo” — splice site sequences (e.g. GC-AG and AT-AC ). Which splice site is used largely depends upon the presence of cis-acting regulatory elements (splicing enhancers vs. suppressors), as well as the physical structures of a branch site and polypyrimidine tract (these sequences bind spliceosome proteins) (Anna & Monika, 2018).
In order to minimize false positive rate, candidate variant lists derived from DNA sequencing typically only take into considerationcanonical changes to boundary motifs (as mentioned earlier: 5’-GT and 3’-AG ) or variants that lie within 100 base pairs of canonical splice sites (that is to say not deep intronic variants). This approach is pragmatic, as these variants are both highly penetrant and rare leading to overall high precision (or positive predictive value). Without additional functional data, the large number of variants further down- or up-stream lead to overall low precision.
RNA-seq provides a direct functional measure of splicing, allowing for fewer variants, with higher inherent accuracy and precision, to be considered. Furthermore, it provides a direct measure of spliced events. As shown in Figure 1B , there are at least 7 classes of splicing events including: (1) exon skipping, (2) mutually-exclusive exons (i.e. a coordinated set of splicing events where, as a result, only one of two exon is retained), (3) exon scrambling, (4) intron retention, (5) alternative promoters and terminators, (6) alternative donor sites and (7) alternative acceptor sites (aka alternative 5′ and 3′ splice sites). It also allows for consideration of variants that are typically filtered out including synonymous variants, UTR variants, and so by supplementation with transcriptomic data. (Chen & Weiss, 2015; Pohl et al., 2013; Shi et al., 2018)
The emergence of RNA-seq in conjunction with germline DNA sequencing has led to optimization, development, and training of new bioinformatic tools for splice-site prediction and splice-site detection. Splice-site prediction focuses on using advanced algorithms and artificial intelligence (AI ) methods to predict from DNA variants alone whether a variant impact splicing. Pragmatically, these approaches suffer from the sheer number of variants and have higher utility when RNA-seq data is also readily available, reducing the overall search space. A useful metric when it comes to assessing alternative splicing events is “PSI ” or percent spliced in; also known as the exon-inclusion ratio, PSI indicates how often a given exon occurs in all isoforms of the gene that contains said exon (Tanner et al., 2021). Splice variant prediction tools typically employ one of two strategies: adaptive models or random forest analysis. The essential difference between these two methods is whether one uses large databases from which non-pathogenic vs. pathogenic splice variants can be trained. Methods like those employed by Liu et al (Liu et al., 2016) utilize random forests to generate probability scores that can be used to estimate the potential of a 3-12bp window within the lariat RNA junction.DeepSplice is an example of a tool that utilizes deep convolutional neural networks, as well as paired events to reduce false positives (Zhang et al., 2018). SpliceAI uses a deep neural network to model mRNA splicing from noncoding sites, yielding a 10% rate of pathogenic variant discovery in neurodevelopmental disorders (Jaganathan et al., 2019; Sanders et al., 2020). These are just a few emerging algorithms and, in practice, databases serve as vehicles for querying calculated predictive scores, such as with dbNSFP (Liu et al., 2020).
Other informatic algorithms directly utilize RNA-seq data in their analysis, such as LeafCutterMD (Jenkinson et al., 2020) andFRASER (Find RAre Splicing Events in RNA-seq (Mertes et al., 2021). SpliceSeq is an example of a tool that utilizes “splice graph” models (Ryan et al., 2012). Tools like SpliceV facilitate discovery and visualization of splicing events (Ungerleider & Flemington, 2019). Underlying these approaches lies layers of additional nuance with variation in alignment strategies (such as variant-aware aligners) and assembly (Hong et al., 2018). As methods rapidly evolve, implemented approaches vary substantially across and within labs, which can adversely impact database resources. Still, it is worth noting that such splicing analysis not only has diagnostic but also therapeutic utility; for instance, tumors with identified splicing variants could be targeted with therapies that inhibit the spliceosome, splicing regulatory proteins, or aberrant splicing products (Lee & Abdel-Wahab, 2016; Scotti & Swanson, 2016).

Indirect Insights From Allele-Specific Expression & Nonsense-Mediated Decay

Another indirect phenomenon into which RNA-seq can provide insight (and which DNA-only analysis cannot appreciate) is allele-specific expression (ASE ). The initial definition of ASE was a purely germline one: preferential expression of one parental allele in a heterozygous individual (Shao et al., 2019). Fundamentally with RNA-seq, ASE requires heterozygous proxy variants, often single nucleotide polymorphism (SNP s), that allow phasing (or at least quantification) of the relative expression between maternally- and paternally-inherited chromosomes (Figure 2 ). These proxy-SNPs include common and rare synonymous, UTR, and missense variants. Importantly, ASE is the end-result and, in practice, one is frequently inferring loss of a pathogenic allele by monogenic expression of the wild-type allele. Processes such as NMD will actively remove transcripts containing pathogenic mutations.
Interestingly, detection of ASE can lead to discoveries of VUS and cancer-related genes. One study in individuals with hereditary pancreatic cancer identified a heterozygous SNP in BRCA2(rs144848) displaying ASE in RNA; the variant was a VUS and the mechanism behind it was determined to be a truncating mutation leading to NMD (Tan et al., 2008). NMD is the process whereby a trimeric complex of proteins degrades mRNA that would otherwise result in potentially-pathogenic truncated proteins; it has been posited that NMD also plays a role in the regulation of normal mRNA expression, as well as regulation of alternative splicing via degradation of splice variants with premature termination codons (Brogna & Wen, 2009). One common mechanism in the pathogenesis of cancer is a nonsense, somatic mutation in a tumor suppressor that leads to NMD and, by means of haploinsufficiency or dominant negative effects, loss of function.
One key historical example leading to ASE involves measuring X-chromosome inactivation or “skewing” (a physiological process normal in females) and genomic imprinting to gain insight into the preferential selection of variants along the X-chromosome (Shao et al., 2019). In genetic females, allele-specific expression also reflects the loss of transcript expression from one allele due to X-inactivation, an important methylation-driven “dosage compensation mechanism” (Shvetsova et al., 2019). Historically, X-skewing has been detected using the HUMARA  assay, which takes advantage of differential methylation at the androgen receptor locus. However, RNA-seq along expressed X-chromosome genes provides additional insights. One can directly observe skewing directionality (ex: 95% skewing in X and preferential expression of wild-type alleles) (Szelinger et al., 2014).
Recognizing that there is natural variability in skewing in healthy females, information regarding X-inactivation is less discriminatory but still useful in unexpected ways. For example, in assessing a candidate variant within the X-linked TAF1 in a boy, our group leveraged RNA-seq data for the boy’s healthy mother which showed extreme X-skewing towards the protective wild-type allele and away from our candidate VUS (Hurst et al., 2018). In other words, RNA-seq was used to evaluate a model where X-skewing protected the mother. Further sequencing showed that the de novo event founded with her and was likely selected, given a history of unsuccessful pregnancies by the boy’s parents. This example has generalizability and is relevant to other X-linked disorders, including Alport, Charcot–Marie–Tooth, Fabry, Fanconi, Fragile X, Hunter, Rett, and Wiskott–Aldrich syndromes (Migeon, 2020).

Outlier Analysis as a Solution to Power Problems

Case-control studies are feasible for common variants but are far more difficult to accomplish for rare VUS where the controls overwhelmingly outnumber the cases. A number of solutions have been put forth to solve this problem. These include likelihood ratio tests, burden tests with genetic scores, adaptive burden tests with data-adaptive weights/thresholds, variance-component tests, exponential combination tests, normal transformation with trait “winsorization” (to find a balance between Type I error and statistical power), or a combination thereof (Auer et al., 2016; Li et al., 2021).
Alternatively, one key solution to address the inherent lack of statistical power when it comes to analyzing rare variants is to usegene expression outlier analysis. Rare variants are often associated with extremes of expression, whether over- or under-expression (X. Li et al., 2017). Variants can be interpreted through careful integration of DNA data with RNA-seq data from other patients or from public resources. Cummings et al. (2017) were one of the first to demonstrate the fact that consideration of such “outlier” variants in RNA-seq data results in an improved diagnostic rate (35% in that study, specifically) in patients for whom a molecular diagnosis cannot be made from DNA-data alone.(Cummings et al., 2017) Gene expression outlier analysis has often been employed in the field of cancer genomics to identify cancer drivers for a specific subset of cancer types or cancer outliers (Alshalalfa et al., 2012; Mori et al., 2013). Fundamentally, outlier analysis does not actually identify a mechanism but rather a gene that is fundamentally different from others expressed in a cohort. As will be discussed later, there is a fundamental presumption that the comparison cohort is relevant.