ABSTRACT
The growing use of new generation sequencing technologies on genetic
diagnosis has produced an exponential increase in the number of Variants
of Uncertain Significance (VUS). In this manuscript we compare three
machine learning methods to classify VUS as Pathogenic orNo pathogenic , implementing a Random Forest (RF), a Support
Vector Machine (SVM), and a Multilayer Perceptron (MLP). To train the
models, we extracted 82,463 high quality variants from ClinVar, using 9
conservation scores, the loss of function tool and allele frequencies.
For the RF and SVM models, hyperparameters were tuned using cross
validation with a grid search. The three models were tested on a set of
5,537 variants that had been classified as VUS any time along the last
three years but had been reclassified in august 2020. The three models
yielded superior accuracy on this set compared to the benchmarked tools.
The RF based model yielded the best performance across different variant
types and was used to create VusPrize, an open source software tool for
prioritization of variants of uncertain significance. We believe that
our model can improve the process of genetic diagnosis on research and
clinical settings.