2.6 Model training and validation
After
feature selection and visualisation (including potential
reclassification of behaviour types), the user can train a supervised
machine learning model (XGBoost in this package) with the selected, most
relevant features through function train_model. Usually, the
construction and evaluation of supervised machine learning models
includes three steps: (i) machine learning model hyperparameter tuning
by cross-validation, (ii) model training with the optimal hyperparameter
set, and (iii) evaluating model performance through validation with a
test dataset. Function train_model is a wrapper function that utilizes
relevant functions from the “caret” package to automatically conduct
the three above steps for model construction and evaluation.
Four
arguments can be set in the function train_model to control the
training and validation processes. Which features to use for model
building is set by ”df”, which in the following example is set to
“selection$features[1:6]” (i.e. the first six selected features
from the feature selection procedure). The “vec_label” argument is
used to pass on a vector of behaviour types. How to select the
hyperparameter set is set by “hyper_choice”, which has two options:
”defaults” will let XGBoost use its default hyperparameters (nrounds =
10) while ”tune” will run repeated cross-validations to find a best set
(note that the settings for the hyperparameters inside this function are
based on our previous experience with a range of different ACC datasets
(Hui et al., in prep) and are set at: nrounds = c(5, 10, 50, 100),
max_depth = c(2, 3, 4, 5, 6), eta = c(0.01, 0.1, 0.2, 0.3), gamma =
c(0, 0.1, 0.5), colsample_bytree = 1, min_child_weight = 1, subsample
= 1). Finally, “train_ratio” determines the percentage of data used
to train the model, the remainder of the data being used for model
validation.
The
ultimate output consists out of four parts. The first is a confusion
matrix, depicting how well the ultimate behaviour classification model
predicts the different behaviours based on the validation part of the
dataset only (i.e 25% of the dataset in our stork example using a
train_ratio of 0.75). On the diagonal of this table, where the observed
behaviour is organised in columns and the predicted behaviour is
organised in rows, the correct predictions are depicted, with all the
wrong predictions being off the diagonal. The overall performance
statistics are presented next, the meaning of which is explained in
detail in
<https://topepo.github.io/caret/measuring-performance.html>.
The third part of the output, statistics by class, presents a range of
performance statistics for the individual behavioural categories, which
are explained in detail in
<https://topepo.github.io/caret/measuring-performance.html>.
Finally, the importance of the various features in producing the
behaviour classification model is being presented.
Another
way of calculating and visualising the performance of the behavioural
classification model makes use of cross-validation using function
plot_confusion_matrix. In this case the entire dataset is randomly
partitioned into five parts. In five consecutive steps, each of the five
parts is used as a validation set, while the remaining four parts are
used for model training. This procedure thus resembles a five-fold
“classification model training and validation” with a train_ratio of
0.8, be it that in this case the dataset is systematically divided and
each point in the dataset is being used for the validation process at
some point (see function createFolds in ”caret” for more details). Thus,
after all five training and validation rounds, all behavioural
observations will also have an associated predicted behaviour, which are
being stored in the data frame that is being returned by
plot_confusion_matrix in addition to a plot of the confusion table
(Fig. 7).