Variant feature selection
To assess possible attributes for model training, we used 24 variant
features, including splice site predictors, conservation scores,
deleteriousness/pathogenicity scores, allele frequency, and consequence
type, from the Ensembl Variant Effect Predictor (McLaren, et al., 2016;
Zerbino, et al., 2018) ). Features with high Pearson correlation were
depurated. Additionally, features with values coming from models trained
with clinical significance data were discarded to avoid circularity
biases on our model estimation phase. First, the features used for
training were: ada score, codon degeneracy score, integrated fitness
conservation score, BLOSUM62 score, Eigen score, phyloP score, Gerp
score, SIFT score, the Loss of Function tool score, the allele
frequencies from the 1000 human genomes project global dataset, and the
variant consequence type codified as dummy binary variables. Clinical
Significance was used as the label for training, and codified using1 for pathogenic, and 0 for benign . To
correct for class unbalance (2/3 benign vs. 1/3 pathogenic variants) we
randomly undersampled benign variants to equalize the number ofpathogenic variants. After testing for the models performance on
the ex-VUS set, models were retrained with the procedure described
before, adding the CADD phred score (retrieved from Ensembl Variant
Effect Predictor) as a feature for the variants.