Parameter tuning for different machine learning models
The dataset of 76,926 variants was split on a training and a test set
with a 80:20 ratio. Using the SciKit Learn library for Python 3, we
trained a model based on a Random Forest, and a model based on a Support
Vector Machine with a RBF kernel. Hyperparameters were tuned using a
grid search with cross-validation approach optimizing the area under the
ROC curves. The hyperparameters tuned on the Random Forest were:Maximum depth, selection criteria, and number of estimators . The
hyperparameters tuned on the Support Vector Machine were C value
and gamma value . Additionally, a Five-Layer Perceptron was trained
using a batch size of 50 and 25 epochs on the Keras library for Python 3
with a TensorFlow backend. A ReLu was chosen as an activation function
for the hidden layers and a Sigmoid as a function for the output layer.
To assess model performance, we plotted the area under the ROC curves
and calculated the area under the ROC curves. We compared the models
trained including and excluding the 1000 human genomes project global
allele frequencies.
Finally, we further tested the resulting models on the set of 5,537ex-VUS and compared their performance against the scores of
commonly used prediction tools (retrieved from Ensemble VEP). First, we
tested the model on the whole set of variants irrespective of their
consequence type. Then, to make a fairer assessment against tools that
yield scores only for specific consequence types (such asmissense type variants), we plotted the ROC curves and calculated
AUCs the same models but on the specific subpopulations of variants.