Explaining when plants flower with machine learning
There is now little doubt that Machine Learning in combination with remote sensing will provide a step change in predictive skill for models of the earth system. For example, over the last few years, a wave of ML based weather forecasting models have emerged, the best of which are comparable to current operational systems. As predictive applications begin maturing, focus is beginning to turn to what we can learn from these models - data driven scientific discovery.
Learning from ML is inherently more complicated than from traditional earth system models. Such models are composed of sets of equations that represent what we know about the physics, chemistry, and biology of the earth system. This makes it easier to disentangle the causal chains leading from one state to another, which is useful for formulating and applying hypotheses about how the system works. ML-based models on the other hand are often composed of a complex web of non-linear functions, which may or may not reflect physical laws.
Explainable AI (XAI) is the process of trying to extract understanding from this network of non-linear functions. The underlying premise of XAI is that by understanding how high performing black box ML models make their predictions, we can discover more about the processes they are predicting. One way of doing this in both classification and regression tasks is to rank the importance of prediction features and to use these rankings to make inferences about what drives dynamical systems.
We took this approach to a problem in the biosphere – trying to understand the environmental determinants of flowering time in plants. We used a range of meteorological and genetic information to develop a set of machine learning models (Random Forest, XGBoost and SVM) that can accurately predict when plants will flower. We then employed a technique called permutation importance to rank the influence input features had on predictive performance. Permutation importance works by performing a random shuffle on each input feature and assessing the influence this has on model error.
The figure above, from our paper ‘A new framework for predicting and understanding flowering time for crop breeding’ shows the permutation importance assigned to each input feature for all three models. The x-axis lists the input features, from most to least important and the y-axis describes their relative importance. Thermal time (a measure of accumulated temperature) was the most important determinant of when a plant flowers in all the models, with cumulative evaporation second most important.
While they all agree on the two most important predictors, it’s immediately clear that the two tree-based algorithms (Random Forest and XGBoost) assign feature importance in similar ways and that the SVM uses information differently. For example, both tree-based algorithms derived majority of their predictive power from thermal time, while the SVM derived similar predictive power from thermal time and cumulative evaporation. In addition, the SVM also derived skill from aspects of relative humidity, while the two tree-based models did not.
Our results suggest that models with contrasting mathematical structures can use information in different ways, and that feature importance may, in some cases, be dependent on the structure of the ML model used. This has important implications for using feature importance to generate hypotheses about the drivers of physical processes in natural systems. Clearly, it is important to use ML models from different mathematical families to be confident that our conclusions are not simply the product of the mathematical structure we are imposing on reality.
Dr Chetan Deva
Paper ref: Deva, C., Dixon, L., Urban, M., Ramirez‐Villegas, J., Droutsas, I. and Challinor, A., 2024. A new framework for predicting and understanding flowering time for crop breeding. Plants, People, Planet, 6(1), pp.197-209.