Model Description
Genetic and environmental information were combined to construct trait
models using a penalised linear-mixed model framework with a LASSO-type
penalisation (Tibshirani, 1996) as implemented in the LMM-Lasso package
(Rakitsch et al., 2013). Regularisation through LASSO-type penalization
prevents potential overfitting caused by the large number of predictors.
This linear-mixed model takes the form y = ΣXβ +
u + ε, where y is a vector of individual trait
values, X is a matrix of daily minimum and maximum temperature
with corresponding fixed effects β (fixed effect),
u is the random effect of the genetic similarity between
pairs of individuals, and ε is the vector of residuals.u is unobserved but assumed to be normally distributed
with u ∼N(0
,σg2K ), where
K is the empirically computed GSM and
σg2 is the variance explained by the
genetic similarity. The residual vector ε is also normally distributedε ∼N(0, σe2I ), whereI is the identity matrix and
σe2 is the residual variance.
The initial model considered genetic and environmental variation
additively and independently (‘G+E model’), such that predicted reaction
norms across environments were identical for all genotypes. In order to
account for the non-linear influence of GxE on climate response, we
computed ADMIXTURE proportions (Alexander & Lange, 2011) for each plant
using k = 4 ancestral populations, which was found to be optimal
(Appendix S4). ADMIXTURE proportions were used to generate additional
predictors XADMIXTURE . For n genotypes andr environmental variables, XADMIXTURE is
the column-wise Khatri-Rao product XADMIXTURE =(FT ⊗RT )T, where Fis the n x k matrix of ADMIXTURE proportions and R is then x r matrix of environmental predictors. This produces ann x kr matrix of additional predictors whose values are unique
for each genotype-environment combination. These predictors were
included alongside the minimum and maximum daily temperature (i.e.R ) in the design matrix X’ to create the ‘GxE model’ which
takes the form y = ΣX’jβ j+ u + ε. We also created ‘G only’ and ‘E only’ models to
determine the relative contribution of each component to prediction
accuracy. These models are identical to the G + E model but used
a column vector of ones as X and a square identity matrix asK respectively.
Assessing Model
Performance
Internal Validation
Model performance was assessed through a random 10-fold cross validation
(‘internal validation’) with 9 folds of the data used to train the model
and the 10th fold used to test it. This was repeated
10 times, with each fold acting as the testing set once. Overall model
performance was quantified using the root mean square error (RMSE) as a
measure of error and the coefficient of determination between observed
and predicted values (r2 ) as a measure of
accuracy.
External Validation
External validation followed an ‘environmental blocking’ validation
strategy (Roberts et al., 2017) designed to assess out-of-sample
prediction accuracy. This involved training models on six plantings and
testing on the seventh to mimic validation on independent data. Results
of environmental blocking were also used to determine the effect of
different training set compositions on model performance.
Finally, we performed an empirical external validation using data from
an independent experiment. Korves and colleagues (2007) performed a
planting of A. thaliana in Rhode Island, USA in Spring 2003 (RS)
for which median DTB was reported. RS is geographically (North America
vs. Europe) and temporally (2003 vs. 2006-2007) distant from the
plantings in our data set, making it a novel environment. We predicted
DTB in RS for 77 genotypes using a model trained on 100% of our data
and 2 metre air temperature records sourced from DAYMET (Thornton et
al., 2016).