Chapter 5 Model formulation

The aim of supervised ML is to find a model \(\hat{Y} = f(X)\) so that \(\hat{Y}\) agrees well with observations \(Y\). We typically start with a research question where \(Y\) is given - naturally - by the problem we are addressing and we have a data set at hand where one or multiple predictors (or features) \(X\) are recorded along with \(Y\). From our data, we have information about how GPP (ecosystem-level photosynthesis) depends on set of abiotic factors, mostly meteorological measurements.

5.1 Formula notation

In R, it is common to use the formula notation to specify the target and predictor variables. You have probably encountered formulas before, e.g., for a linear regression using the lm() function. To specify a linear regression model for GPP_NT_VUT_REF with three predictors SW_F_IN, VPD_F, and TA_F, we write:

lm(GPP_NT_VUT_REF ~ SW_F_IN + VPD_F + TA_F, data = ddf)

5.2 The generic train()

Actually, the way we formulate a model is independent of the algorithm, or engine that takes care of fitting \(f(X)\). As mentioned in Chapter 4 the R package caret provides a unified interface for using different ML algorithms implemented in separate packages. In other words, it acts as a wrapper for multiple different model fitting, or ML algorithms. This has the advantage that it unifies the interface (the way arguments are provided). caret also provides implementations for a set of commonly used tools for data processing, model training, and evaluation. We’ll use caret for model training with the function train() (more on model training in Chapter 6). Note however, that using a specific algorithm, which is implemented in a specific package outside caret, also requires that the respective package be installed and loaded. Using caret for specifying the same linear regression model as above, the base-R lm() function, can be done with caret in a generalized form as:

library(caret)
train(
  form = GPP_NT_VUT_REF ~ SW_IN_F + VPD_F + TA_F, 
  data = ddf %>% drop_na(), 
  method = "lm"
)
## Linear Regression 
## 
## 6179 samples
##    3 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 6179, 6179, 6179, 6179, 6179, 6179, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   1.599712  0.6646581  1.241035
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

Of course, this is an overkill compared to just writing lm(...). But the advantage of the unified interface is that we can simply replace the method argument to use a different ML algorithm. For example, to use a random forest model implemented by the ranger package, we can write:

## do not run
train(
  form = GPP_NT_VUT_REF ~ SW_IN_F + VPD_F + TA_F, 
  data = ddf, 
  method = "ranger",
  ...
)

The ... hints at the fact that there are a few more arguments to be specified for training a random forest model with ranger. More on that in Chapter 6.

5.3 Recipes

The recipes package provides another way to specify the formula and pre-processing steps in one go and is compatible with caret’s train() function. For the same formula as above, and an example where the data ddf_train is to be centered and scaled, we can specify the “recipe” using the tidyverse-style pipe operator as:

library(recipes)
pp <- recipe(GPP_NT_VUT_REF ~ SW_IN_F + VPD_F + TA_F, data = ddf_train) %>% 
  step_center(all_numeric(), -all_outcomes()) %>%
  step_scale(all_numeric(), -all_outcomes())

The first line assigns roles to the different variables. GPP_NT_VUT_REF is an outcome (in “recipes speak”). Then, we used selectors to apply the recipe step to several variables at once. The first selector, all_numeric(), selects all variables that are either integers or real values. The second selector, -all_outcomes() removes any outcome (target) variables from this recipe step.

The object pp can then be supplied to train() as its first argument:

## do not run
train(
  pp, 
  data = ddf_train, 
  method = "ranger",
  ...
)

As seen above for the pre-processing example, this does not return a standardized version of the data frame ddf_train, but rather the information that allows us to apply the same standardization also to other data sets.