4. Discussion
Fitting Allometric Regression Models
The above-ground biomass was strongly correlated with DBH so that DBH is the most influential factor affecting the biomass of the trees. Height is the second important factor correlating with biomass showing a strong correlation coefficient while the biomass was poorly correlated with wood density. Thus, if there is no multicollinearity between DBH and H, the combination of these two variables gives the best regression model. Conducting a correlation test between response and explanatory variables is an important step to select an appropriate explanatory variable and to develop the best-fit regression model (Maraseni et al ., 2005).
Model fitting for Pouteria adolfi-friederici involves choosing the best model among the 3 statistically significant candidate models. The third complete model (Model six) is selected against the first two nested models since the p-values of coefficients of the complete model are significant (p < 0.05). In the case of nested models, a statistical test can be used to test one of the models against the other. The null hypothesis of this test is that θ = 0, i.e. the additional terms are not significant, which can also be expressed as the nested model is better than the complete model. If the p-value of this test proves to be below the significance level (typically 5 %), then the null hypothesis is rejected, i.e. the complete model is best. Conversely, if the p-value is above the significance threshold, the nested model is considered to be the best (Picard et al., 2012). Since the best-fit model selected for P. adolfi-friedericiinvolves two predictor variables, the absence of Multicollinearity was also checked using the Variance Inflation Factor. The VIF value obtained was 1.8 for each predictor variable and this value assures that multicollinearity is not a problem in this model. A strong correlation between independent variables causes ‘strong multicollinearity’ by which the true effect of estimated regression coefficients would be lost. The general rule of thumb is that VIFs exceeding 4 warrant further investigations, VIFs exceeding 10 are signs of serious multicollinearity requiring correction while VIFs less than 4 indicates an absence of multicollinearity problem (Belsley et al., 1980). The overall accuracy of the model is 95.11% with an overall p-value < 0.001, in this model the Adjusted R-squared (0.9511) is closer to 1, showing that 95.11% of the variability of the above-ground biomass is explained by this model. The ANOVA test conducted also shows that the overall allometric equation was found to be statistically significant (F=283.042, P < 0.001).Model validationThe best-fit regression models developed were validated based on a series of assumptions, those of major importance are the residuals are independent, residuals follow a normal distribution and residuals variance is constant (homoscedasticity). Violation of these assumptions may result in biased parameter estimates and type I errors (Quinn & Keough 2002; Picard et al., 2012). The Independence Assumption was tested by Durbin-Watson statistics which has values ranging between 0 and 4. However, the residuals are considered not correlated (independent) if the Durbin-Watson statistic is between 1.5 and 2.5 (Field, 2009). The Durbin-Watson statistics of models is less than 2.5, indicating that the residuals for the model selected are uncorrelated; therefore, the independence assumption is met in this study. The independence assumption is a significant assumption that should be investigated prior to any interpretation of multiple regression analysis, as a violation of this assumption could hold critical implications (Stevens, 2009). Even a slight violation of the independence assumption should be taken seriously, as it can greatly increase the risk of Type 1 error, resulting in the risk of falsely rejecting the null hypothesis several times greater than the level of error assumed for the test (Stevens, 2009). The assumption of normality and homoscedasticity was tested by thoroughly investigating a quantile-quantile graph (Q-Q plot) and the residuals against the fitted values plot developed. The residual errors plotted versus their fitted values for the best-fit model indicated that the residuals are randomly distributed around the horizontal line representing a residual error of zero; that is, no distinct trend in the distribution of points is observed and confirms the homoscedasticity assumption of the model. The standard Q-Q plot of the best-fit models also suggests that the residual errors are normally distributed.
Screening for normality is an important early step when conducting multiple regressions, as residuals are normally distributed is assumed (Stevens, 2009; Tabachnick & Fidell, 2006). Non-normal distributions that are positively or negatively skewed, contain large kurtosis or have extreme outliers which can distort the obtained significance levels of the analysis, resulting in the standard errors becoming biased (Osborne & Waters, 2002).
The assumption of homoscedastic indicates that the variance of errors is equal and constant across all levels of the variables (Osborne & Waters, 2002; Stevens, 2009). Homoscedasticity is related to the assumption of normality because when the assumption of normality is met, the relationship between the variables is homoscedastic (Tabachnick & Fidell, 2006). Heteroscedasticity occurs when the variance of errors differs at different values of the independent variables (Osborne & Waters, 2002). Slight heteroscedasticity has little effect on significance tests; however, when heteroscedasticity is marked it can lead to serious distortions of findings and seriously weaken the analysis thus increasing the possibility of a Type 1 error for a small sample size (Osborne & Waters, 2002).
The scale-location plot for all models shows the square root of the standardized residuals (sort of a square root of relative error) as a function of the fitted values. Again, there is no obvious trend in this plot which further confirms the absence of heteroscedasticity. Finally, the plot in the lower right (Figure 3) shows each point’s leverage, which is a measure of its importance in determining the regression result. Superimposed on the plot are contour lines for the Cook’s distance, which is another measure of the importance of each observation to the regression. Smaller distances mean that removing the observation has little effect on the regression results. Distances larger than 1 are suspicious and suggest the presence of a possible outlier or a poor model. In this study, the model selected exhibits Cook’s distance of less than 1 which confirms the absence of possible outliers.