Parameter tuning for different machine learning models
The dataset of 76,926 variants was split on a training and a test set with a 80:20 ratio. Using the SciKit Learn library for Python 3, we trained a model based on a Random Forest, and a model based on a Support Vector Machine with a RBF kernel. Hyperparameters were tuned using a grid search with cross-validation approach optimizing the area under the ROC curves. The hyperparameters tuned on the Random Forest were:Maximum depth, selection criteria, and number of estimators . The hyperparameters tuned on the Support Vector Machine were C value and gamma value . Additionally, a Five-Layer Perceptron was trained using a batch size of 50 and 25 epochs on the Keras library for Python 3 with a TensorFlow backend. A ReLu was chosen as an activation function for the hidden layers and a Sigmoid as a function for the output layer. To assess model performance, we plotted the area under the ROC curves and calculated the area under the ROC curves. We compared the models trained including and excluding the 1000 human genomes project global allele frequencies.
Finally, we further tested the resulting models on the set of 5,537ex-VUS and compared their performance against the scores of commonly used prediction tools (retrieved from Ensemble VEP). First, we tested the model on the whole set of variants irrespective of their consequence type. Then, to make a fairer assessment against tools that yield scores only for specific consequence types (such asmissense type variants), we plotted the ROC curves and calculated AUCs the same models but on the specific subpopulations of variants.