INTRODUCTION
Rare variants on genes involved in genetic disease produce a high toll
of disability and premature death worldwide. For example, some variants
on the CFTR gene cause cystic fibrosis (Strausbaugh & Davis,
2007), variants on the HBB gene can cause sickle cell disease
(Kato et al., 2018), and variants on the LDLR gene cause familial
hypercholesterolemia (Defesche et al., 2017), all diseases with heavy
burdens on health as well as quality of life on patients and their
families. Despite the low frequency of each genetic disorder, there are
around eight thousand genes where single variants can lead to genetic
diseases (Amberger, Bocchini, Scott, & Hamosh, 2020), resulting in a
high total frequency of genetic diseases and affecting more than 300
million people worldwide. One of the main concerns on genetic diseases
is diagnostic delay. For rare diseases, 80% of which have a known
genetic cause, the delay until a correct diagnosis is given is on
average 4.8 years (Evans, 2018) but can be as long as 30 years (Gainotti
et al., 2018), causing an additional burden of stress for medical
practitioners, patients, and their families.
The process of genetic diagnosis aims to correctly identify the genetic
variant that is causing a specific disease. This is a complex process
that involves taking into account multiple data sources including, but
not limited to, gene and phenotype association, allele frequencies on a
population relevant to the patient, the inheritance pattern of the
disease, functional studies related to suspected variants, and
computational predictions (Richards, et al., 2015). Until recently, the
process was based on gene panels or chromosomal arrays that included a
limited number of variants known to be pathogenic and associated with
particular diseases (Fogel, Satya-Murti, & Cohen, 2016; Miller, et al.,
2010). With this approach, variants that are not included in the assay
cannot be detected. In the last ten years, the introduction of high
throughput sequencing technologies (Whole Exome and Whole Genome
Sequencing) have increased the yield of variants detected in a single
test, and have demonstrated superior clinical and diagnostic utility
than the formerly used first line-tests for many diseases (Clark, et
al., 2018). However, due to the complexity of the process, the higher
yield of detected variants has not been coupled with a proportional
increase in variant interpretation capabilities, resulting in an
explosion of variants of uncertain significance (VUS). In fact, the
number of variants classified as VUS have exponentially increased in the
last few years, and the majority of clinically-interpreted variants are
currently VUS (Weile & Roth, 2018). This problem is even more prevalent
among “underrepresented minorities” compared to Caucasian populations,
as there are fewer genomic and clinical studies with patients on these
populations (Walsh, et al., 2019).
Ideally, VUS are reclassified in a more informative category
(pathogenic , likely pathogenic , likely benign , orbenign ) but achieving this goal requires ascertaining new
information on the variant through experimental or population studies,
which take time and consume resources. One way to prioritize VUS with a
higher probability of being pathogenic (i.e. disease-causing) for
further studies is to use computational predictive tools. Computational
predictive tools are models that estimate the probability that a given
variant is deleterious or pathogenic based on information about its
evolutionary conservation, its effect on protein structure or function
(if it is a coding variant), or its effect on relevant features of the
DNA sequence (v.g. splice sites, regulatory sites, protein-DNA binding
sites, among others). The most commonly used tools (Ghosh, Oak, & Plon,
2017) yield scores ranging from 0 to 1, some of which reflect a
probability value, while not all are calibrated to reflect a true
probability. Some tools such as CADD (Rentzsch,, et al., 2019) yield
phred-scores as well. Probability scores can be obtained using
calibration formulas. Using these probability scores researchers and
clinicians are able to prioritize VUS with a higher probability of being
pathogenic (i.e. disease causing), and can potentially guide clinical
decision-making processes for these types of variants when additional
evidence is lacking. However, currently used predictors have several
shortcomings. First, most predictors are designed for missense type
variants, leaving out an important proportion of the variants currently
classified as VUS, which have different consequence data types (Figure
1). Some frameworks that work for other variant types (v.g.
MutationSVM), have separate tools for each variant type. Additionally,
tools for missense variants tend to overestimate the pathogenicity of
benign variants. Finally, while other tools perform better as
classifiers of pathogenic vs. non pathogenic, the probability
distributions for VUS do not reflect the suggested thresholds of
probability suggested by the ACMG for variant classification on the four
remaining categories.
Here, we present a comparison of three machine learning models (Random
Forest, Support Vector Machine, and a Five-Layer Perceptron) in a
one-for-all approach, meaning that each model can correctly prioritize
variants of different consequence types. To increase their predictive
power and interpretability, we merged ACMG Benign andLikely Benign categories into a unique Benign category, and ACMGPathogenic and Likely Pathogenic categories into a unique
Pathogenic category. To avoid circularity bias, we trained our models
using conservation scores that did not include clinical interpretation
data. Additionally, we demonstrated that including allele frequencies
increases the predictive power of the models. To assess the performance
of the resulting models for prioritization of the VUS population, we
benchmarked the resulting models against currently used predictors using
a set of variants that had been classified as VUS on the last three
years, but have been reclassified into the remaining categories as of
august 2020 on ClinVar (Landrum, et al., 2017), showing superior
performance among different variant consequence types.