INTRODUCTION
Rare variants on genes involved in genetic disease produce a high toll of disability and premature death worldwide. For example, some variants on the CFTR gene cause cystic fibrosis (Strausbaugh & Davis, 2007), variants on the HBB gene can cause sickle cell disease (Kato et al., 2018), and variants on the LDLR gene cause familial hypercholesterolemia (Defesche et al., 2017), all diseases with heavy burdens on health as well as quality of life on patients and their families. Despite the low frequency of each genetic disorder, there are around eight thousand genes where single variants can lead to genetic diseases (Amberger, Bocchini, Scott, & Hamosh, 2020), resulting in a high total frequency of genetic diseases and affecting more than 300 million people worldwide. One of the main concerns on genetic diseases is diagnostic delay. For rare diseases, 80% of which have a known genetic cause, the delay until a correct diagnosis is given is on average 4.8 years (Evans, 2018) but can be as long as 30 years (Gainotti et al., 2018), causing an additional burden of stress for medical practitioners, patients, and their families.
The process of genetic diagnosis aims to correctly identify the genetic variant that is causing a specific disease. This is a complex process that involves taking into account multiple data sources including, but not limited to, gene and phenotype association, allele frequencies on a population relevant to the patient, the inheritance pattern of the disease, functional studies related to suspected variants, and computational predictions (Richards, et al., 2015). Until recently, the process was based on gene panels or chromosomal arrays that included a limited number of variants known to be pathogenic and associated with particular diseases (Fogel, Satya-Murti, & Cohen, 2016; Miller, et al., 2010). With this approach, variants that are not included in the assay cannot be detected. In the last ten years, the introduction of high throughput sequencing technologies (Whole Exome and Whole Genome Sequencing) have increased the yield of variants detected in a single test, and have demonstrated superior clinical and diagnostic utility than the formerly used first line-tests for many diseases (Clark, et al., 2018). However, due to the complexity of the process, the higher yield of detected variants has not been coupled with a proportional increase in variant interpretation capabilities, resulting in an explosion of variants of uncertain significance (VUS). In fact, the number of variants classified as VUS have exponentially increased in the last few years, and the majority of clinically-interpreted variants are currently VUS (Weile & Roth, 2018). This problem is even more prevalent among “underrepresented minorities” compared to Caucasian populations, as there are fewer genomic and clinical studies with patients on these populations (Walsh, et al., 2019).
Ideally, VUS are reclassified in a more informative category (pathogenic , likely pathogenic , likely benign , orbenign ) but achieving this goal requires ascertaining new information on the variant through experimental or population studies, which take time and consume resources. One way to prioritize VUS with a higher probability of being pathogenic (i.e. disease-causing) for further studies is to use computational predictive tools. Computational predictive tools are models that estimate the probability that a given variant is deleterious or pathogenic based on information about its evolutionary conservation, its effect on protein structure or function (if it is a coding variant), or its effect on relevant features of the DNA sequence (v.g. splice sites, regulatory sites, protein-DNA binding sites, among others). The most commonly used tools (Ghosh, Oak, & Plon, 2017) yield scores ranging from 0 to 1, some of which reflect a probability value, while not all are calibrated to reflect a true probability. Some tools such as CADD (Rentzsch,, et al., 2019) yield phred-scores as well. Probability scores can be obtained using calibration formulas. Using these probability scores researchers and clinicians are able to prioritize VUS with a higher probability of being pathogenic (i.e. disease causing), and can potentially guide clinical decision-making processes for these types of variants when additional evidence is lacking. However, currently used predictors have several shortcomings. First, most predictors are designed for missense type variants, leaving out an important proportion of the variants currently classified as VUS, which have different consequence data types (Figure 1). Some frameworks that work for other variant types (v.g. MutationSVM), have separate tools for each variant type. Additionally, tools for missense variants tend to overestimate the pathogenicity of benign variants. Finally, while other tools perform better as classifiers of pathogenic vs. non pathogenic, the probability distributions for VUS do not reflect the suggested thresholds of probability suggested by the ACMG for variant classification on the four remaining categories.
Here, we present a comparison of three machine learning models (Random Forest, Support Vector Machine, and a Five-Layer Perceptron) in a one-for-all approach, meaning that each model can correctly prioritize variants of different consequence types. To increase their predictive power and interpretability, we merged ACMG Benign andLikely Benign categories into a unique Benign category, and ACMGPathogenic and Likely Pathogenic categories into a unique Pathogenic category. To avoid circularity bias, we trained our models using conservation scores that did not include clinical interpretation data. Additionally, we demonstrated that including allele frequencies increases the predictive power of the models. To assess the performance of the resulting models for prioritization of the VUS population, we benchmarked the resulting models against currently used predictors using a set of variants that had been classified as VUS on the last three years, but have been reclassified into the remaining categories as of august 2020 on ClinVar (Landrum, et al., 2017), showing superior performance among different variant consequence types.