2.4 Self-Simulated Learning Artificial Intelligence (SSLAI) Analysis
A classifier algorithm used to compare machine learning (ML) methods for classification was developed in Python 3 using the library of sklearn 3.2. [50] Ten ML methods for classification were employed: K-Neighbors Classifier, Support Vector Machine (SVM), Nu-Support Vector Classification (NuSVC), Decision Tree Classifier, Random Forest Classifier, AdaBoost Classifier, Gradient Boosting Classifier, Gaussian Naive Bayes, Linear Discriminant Analysis, and Quadratic Discriminant Analysis. A basic description of each ML method used is included in Table 1 .
The most efficient ML method for the prediction of the data was selected by the algorithm, considering the highest accuracy value and the lowest Log Loss value (defined later). To accomplish this, it was necessary to provide the algorithm with the simulated training data and the real test data. The algorithm trained each of the ten methods, calculating the accuracy and the log-loss, and comparing the results.
The package of libraries in scikit-learn [50]provides a set of open-source software of efficient AI techniques for the Python programming language. They are accessible to non-experts of ML and apply to various scientific disciplines. The calibration model was developed with simulated training data. The simulated training data are linear combinations of each of the spectra of neat HE with each of the spectra of neat soil considering their compensation of intensities in percentages. External evaluation of the predictions was accomplished using HE/soil mixes. These data were not used in the calibration model. A schematic representation of SSLAI analysis is shown in Fig. 1 .
The parameters that were used to evaluate the performance of the classification model developed were: recall, Log Loss, precision, f1-score, weighted average, support, and accuracy. In binary classification, recall of the positive class is also known as “sensitivity,” and recall of the negative class is “specificity.” The Log Loss function is used in (multinomial) logistic regression and extensions of it, such as Neural Networks, defined as the negative log-likelihood of the true labels given a probabilistic classifier’s predictions. The Log Loss is only defined for two or more labels. For a single sample with true label yt in {0,1} and estimated probability yp that yt = 1, the Log Loss is (Eq. (1) :
-log P(yt|yp) = -(yt) log(yp) + (1 - yt) log(1 - yp)). (1)