Machine learning models to improve classification of VUS
We explored three different machine learning strategies to classify variants that are currently assigned as variants of uncertain significance (VUS) by standard variant interpretation pipelines.
After building three models for VUS pathogenicity prediction based on a Random Forest (RF), a Support Vector Machine (SVM), and a Five-Layer Perceptron (MLP), their performance was measured on a set of variants previously classified as VUS but reclassified in any of the other categories in ClinVar with at least two quality stars. This set includes 5,537 variants representative of the main variant consequence types (Figure 1a), including 2,008 (36.3%) missense variants, 1,844 (33.3%) synonymous variants, 349 (6.3%) intron variants, 475 (8.6%) splice variants, 340 (6.32%) non-coding mRNA variants, 69 (1.25%) coding INDEL variants, 151 (2.73%) intergenic, and 290 (5.22%) of other variant types (5-prime UTR variants, 3-prime UTR variants, upstream gene variants, downstream gene variants, TF binding site variants, and nonsense variants). As measured by the area under the curve of the Receiving Operator Characteristic curves (AUROC), our three models outperform the best performing of the benchmarked tools (CADD, with an AUC of 0.92), with an AUC of 0.97 for the RF and the MLP based models, and a AUC of 0.96 for the SVM based model (Figure 1b). Additionally, the three models were trained separately including and excluding 1000 Human Genomes global allele frequencies to compare their performance on the original test set. For all the models analyzed, including the 1KG Global Allele Frequencies increased performance measured by the AUC (See Supplementary Figure S1).