Machine learning models to improve classification of VUS
We explored three different machine learning strategies to classify
variants that are currently assigned as variants of uncertain
significance (VUS) by standard variant interpretation pipelines.
After building three models for VUS pathogenicity prediction based on a
Random Forest (RF), a Support Vector Machine (SVM), and a Five-Layer
Perceptron (MLP), their performance was measured on a set of variants
previously classified as VUS but reclassified in any of the other
categories in ClinVar with at least two quality stars. This set includes
5,537 variants representative of the main variant consequence types
(Figure 1a), including 2,008 (36.3%) missense variants, 1,844
(33.3%) synonymous variants, 349 (6.3%) intron variants,
475 (8.6%) splice variants, 340 (6.32%) non-coding mRNA
variants, 69 (1.25%) coding INDEL variants, 151 (2.73%) intergenic,
and 290 (5.22%) of other variant types (5-prime UTR variants, 3-prime
UTR variants, upstream gene variants, downstream gene variants, TF
binding site variants, and nonsense variants). As measured by the area
under the curve of the Receiving Operator Characteristic curves (AUROC),
our three models outperform the best performing of the benchmarked tools
(CADD, with an AUC of 0.92), with an AUC of 0.97 for the RF and the MLP
based models, and a AUC of 0.96 for the SVM based model (Figure 1b).
Additionally, the three models were trained separately including and
excluding 1000 Human Genomes global allele frequencies to compare their
performance on the original test set. For all the models analyzed,
including the 1KG Global Allele Frequencies increased performance
measured by the AUC (See Supplementary Figure S1).