Chapter 8 Solutions
Now that you are familiar with the basic steps for supervised machine learning, you can get your hands on the data yourself and implement code for addressing the modelling task outlined in Chapter ??.
8.1 Reading and cleaning
Read the CSV file "./data/FLX_CH-Dav_FLUXNET2015_FULLSET_DD_1997-2014_1-3.csv"
, select all variables with name ending with "_F"
, the variables "TIMESTAMP"
, "GPP_NT_VUT_REF"
, and "NEE_VUT_REF_QC"
, and drop all variables that contain "JSB"
in their name. Then convert the variable "TIMESTAMP"
to a date-time object with the function ymd()
from the lubridate package, and interpret all values -9999
as missing values. Then, set all values of "GPP_NT_VUT_REF"
to missing if the corresponding quality control variable indicates that less than 90% are measured data points. Finally, drop the variable "NEE_VUT_REF_QC"
- we won’t use it anymore.
library(tidyverse)
<- read_csv("./data/FLX_CH-Dav_FLUXNET2015_FULLSET_DD_1997-2014_1-3.csv") %>%
ddf
## select only the variables we are interested in
select(starts_with("TIMESTAMP"),
ends_with("_F"), # all meteorological variables
GPP_NT_VUT_REF,
NEE_VUT_REF_QC,-contains("JSB")) %>%
## convert to a nice date object
mutate(TIMESTAMP = lubridate::ymd(TIMESTAMP)) %>%
## set all -9999 to NA
na_if(-9999) %>%
## drop QC variables (no longer needed), except NEE_VUT_REF_QC
select(-ends_with("_QC"), NEE_VUT_REF_QC) %>%
mutate(GPP_NT_VUT_REF = ifelse(NEE_VUT_REF_QC < 0.9, NA, GPP_NT_VUT_REF)) %>%
select(-NEE_VUT_REF_QC)
8.2 Data splitting
Split the data a training and testing set, that contain 70% and 30% of the total available data, respectively.
library(rsample)
set.seed(1982) # for reproducibility
<- initial_split(ddf, prop = 0.7)
split <- training(split)
ddf_train <- testing(split) ddf_test
8.3 Linear model
8.3.1 Training
Fit a linear regression model using the base-R function lm()
and the training set. The target variable is "GPP_NT_VUT_REF"
, and predictor variables are all available meterological variables in the dataset. Answer the following questions:
- What is the \(R^2\) of predicted vs. observed
"GPP_NT_VUT_REF"
? - Is the linear regression slope significantly different from zero for all predictors?
- Is a linear regression model with “poor” predictors removed better supported by the data than a model including all predictors?
## fit linear regression model
<- lm(
linmod_baser form = GPP_NT_VUT_REF ~ .,
data = ddf_train %>%
drop_na() %>%
select(-TIMESTAMP)
)
## show summary information of the model
summary(linmod_baser)
##
## Call:
## lm(formula = GPP_NT_VUT_REF ~ ., data = ddf_train %>% drop_na() %>%
## select(-TIMESTAMP))
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.1941 -1.0393 -0.1299 0.8859 7.7205
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.9570704 3.4840107 -0.275 0.783558
## TA_F 0.1633211 0.0108181 15.097 < 2e-16 ***
## SW_IN_F 0.0137226 0.0003914 35.064 < 2e-16 ***
## LW_IN_F 0.0207311 0.0012040 17.219 < 2e-16 ***
## VPD_F -0.1420886 0.0195509 -7.268 4.35e-13 ***
## PA_F -0.0402727 0.0404675 -0.995 0.319704
## P_F -0.0214314 0.0046177 -4.641 3.57e-06 ***
## WS_F -0.1418356 0.0398616 -3.558 0.000378 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.549 on 4135 degrees of freedom
## Multiple R-squared: 0.6904, Adjusted R-squared: 0.6899
## F-statistic: 1317 on 7 and 4135 DF, p-value: < 2.2e-16
## Models can be compared, for example by the Bayesian Information Criterion (BIC).
## It penalizes complex models and is optimal (lowest) where the trade-off between
## explanatory power on the training set vs. number of predictors is best. The BIC
## tends to be more conservative than the AIC.
BIC(linmod_baser)
## [1] 15450.89
The variable PA_F
was not significant in the linear model. Therefore, we won’t use it for the models below.
## Fit an lm model on the same data, but with PA_F removed.
<- lm(
linmod_baser_nopaf form = GPP_NT_VUT_REF ~ .,
data = ddf_train %>%
drop_na() %>%
select(-TIMESTAMP, -PA_F)
)
## The R-squared is slightly lower here (0.6903) than in the model with all predictors (0.6904)
summary(linmod_baser_nopaf)
##
## Call:
## lm(formula = GPP_NT_VUT_REF ~ ., data = ddf_train %>% drop_na() %>%
## select(-TIMESTAMP, -PA_F))
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.1849 -1.0390 -0.1306 0.8867 7.7594
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.4102425 0.3135013 -14.068 < 2e-16 ***
## TA_F 0.1595814 0.0101445 15.731 < 2e-16 ***
## SW_IN_F 0.0136923 0.0003902 35.093 < 2e-16 ***
## LW_IN_F 0.0210424 0.0011626 18.099 < 2e-16 ***
## VPD_F -0.1400240 0.0194405 -7.203 6.98e-13 ***
## P_F -0.0212661 0.0046147 -4.608 4.18e-06 ***
## WS_F -0.1343263 0.0391409 -3.432 0.000605 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.549 on 4136 degrees of freedom
## Multiple R-squared: 0.6903, Adjusted R-squared: 0.6899
## F-statistic: 1537 on 6 and 4136 DF, p-value: < 2.2e-16
## ... but the BIC is clearly lower, indicating that the model with PA_F removed is better.
BIC(linmod_baser_nopaf)
## [1] 15443.55
Use caret and the function train()
for fitting the same linear regression model (with all predictors) on the same data. Does it yield identical results as using lm()
directly? You will have to set the argument trControl
accordingly to avoid resampling, and instead fit the model on the all data in ddf_train
. You can use summary()
also on the object returned by the function train()
.
library(caret)
<- train(
linmod_caret form = GPP_NT_VUT_REF ~ .,
data = ddf_train %>%
drop_na() %>%
select(-TIMESTAMP),
method = "lm",
trControl = trainControl(method = "none")
)
summary(linmod_caret)
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.1941 -1.0393 -0.1299 0.8859 7.7205
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.9570704 3.4840107 -0.275 0.783558
## TA_F 0.1633211 0.0108181 15.097 < 2e-16 ***
## SW_IN_F 0.0137226 0.0003914 35.064 < 2e-16 ***
## LW_IN_F 0.0207311 0.0012040 17.219 < 2e-16 ***
## VPD_F -0.1420886 0.0195509 -7.268 4.35e-13 ***
## PA_F -0.0402727 0.0404675 -0.995 0.319704
## P_F -0.0214314 0.0046177 -4.641 3.57e-06 ***
## WS_F -0.1418356 0.0398616 -3.558 0.000378 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.549 on 4135 degrees of freedom
## Multiple R-squared: 0.6904, Adjusted R-squared: 0.6899
## F-statistic: 1317 on 7 and 4135 DF, p-value: < 2.2e-16
8.3.2 Prediction
With the model containing all predictors and fitted on ddf_train
, make predictions using first ddf_train
and then ddf_test
. Compute the \(R^2\) and the root-mean-square error, and visualise modelled vs. observed values to evaluate both predictions.
Do you expect the linear regression model trained on ddf_train
to predict substantially better on ddf_train
than on ddf_test
? Why (not)?
Hints:
- To calculate predictions, use the generic function
predict()
with the argumentnewdata = ...
. - The \(R^2\) can be extracted from the model object as
summary(model_object)$r.squared
, or is (as the RMSE) given in the metrics data frame returned bymetrics()
from the yardstick library. - For visualisation the model performance, consider a scatterplot, or (better) a plot that reveals the density of overlapping points. (We’re plotting information from over 4000 data points here!)
library(patchwork)
library(yardstick)
## made into a function to reuse code below
<- function(mod, df_train, df_test){
eval_model
## add predictions to the data frames
<- df_train %>%
df_train drop_na() %>%
mutate(fitted = predict(mod, newdata = .))
<- df_test %>%
df_test drop_na() %>%
mutate(fitted = predict(mod, newdata = .))
## get metrics tables
<- df_train %>%
metrics_train ::metrics(GPP_NT_VUT_REF, fitted)
yardstick
<- df_test %>%
metrics_test ::metrics(GPP_NT_VUT_REF, fitted)
yardstick
## extract values from metrics tables
<- metrics_train %>%
rmse_train filter(.metric == "rmse") %>%
pull(.estimate)
<- metrics_train %>%
rsq_train filter(.metric == "rsq") %>%
pull(.estimate)
<- metrics_test %>%
rmse_test filter(.metric == "rmse") %>%
pull(.estimate)
<- metrics_test %>%
rsq_test filter(.metric == "rsq") %>%
pull(.estimate)
## visualise with a hexagon binning and a customised color scale,
## adding information of metrics as sub-titles
<- df_train %>%
gg1 ggplot(aes(GPP_NT_VUT_REF, fitted)) +
geom_hex() +
scale_fill_gradientn(
colours = colorRampPalette( c("gray65", "navy", "red", "yellow"))(5)) +
geom_abline(slope = 1, intercept = 0, linetype = "dotted") +
labs(subtitle = bquote( italic(R)^2 == .(format(rsq_train, digits = 2)) ~~
== .(format(rmse_train, digits = 3))),
RMSE title = "Training set") +
theme_classic()
<- df_test %>%
gg2 ggplot(aes(GPP_NT_VUT_REF, fitted)) +
geom_hex() +
scale_fill_gradientn(
colours = colorRampPalette( c("gray65", "navy", "red", "yellow"))(5)) +
geom_abline(slope = 1, intercept = 0, linetype = "dotted") +
labs(subtitle = bquote( italic(R)^2 == .(format(rsq_test, digits = 2)) ~~
== .(format(rmse_test, digits = 3))),
RMSE title = "Test set") +
theme_classic()
return(gg1 + gg2)
}
eval_model(mod = linmod_baser, df_train = ddf_train, df_test = ddf_test)
Here, the function eval_model()
returned an object that is made up of two plots (return(gg1 + gg2)
in the function definition). This combination of plots by +
is enabled by the patchwork library. The individual plot objects (gg1
and gg2
) are returned by the ggplot()
functions. The visualisation here is density plot of hexagonal bins. It shows the number of points inside each bin, encoded by the color (see legend “count”). We want the highest density of points along the 1:1 line (the dotted line). Predictions match observations perfectly for points lying on the 1:1 line. Alternatively, we could also use a scatterplot to visualise the model evaluation. However, a large number of points would overlie each other. As typical machine learning applications make use of large number of data, such evaluation plots would typically face the problem of overlying points and density plots are a solution.
Metrics are given in the subtitle of the plots. Note that the \(R^2\) and the RMSE measure different aspects of model-data agreement. Here, the measure the correlation (fraction of variation explained), and the average error. We should generally consider multiple metrics measuring multiple aspects of the prediction-observation fit to evaluate models.
8.4 KNN
8.4.1 Check data
%>%
ddf_train mutate(split = "train") %>%
bind_rows(ddf_test %>%
mutate(split = "test")) %>%
pivot_longer(cols = 2:9, names_to = "variable", values_to = "value") %>%
ggplot(aes(x = value, y = ..density.., color = split)) +
geom_density() +
facet_wrap(~variable, scales = "free")
8.4.2 Training
Fit two KNN models on ddf_train
(excluding "PA_F"
), one with \(k = 2\) and one with \(k = 30\), both without resampling. Use the RMSE as the loss function. Center and scale data as part of the pre-processing and model formulation specification using the function recipe()
.
library(recipes)
## model formulation and preprocessing specification
<- recipe(
myrecipe ~ TA_F + SW_IN_F + LW_IN_F + VPD_F + P_F + WS_F,
GPP_NT_VUT_REF data = ddf_train %>% drop_na()) %>%
step_center(all_numeric(), -all_outcomes()) %>%
step_scale(all_numeric(), -all_outcomes())
## fit model with k=2
<- train(
mod_knn_k2
myrecipe, data = ddf_train %>%
drop_na(),
method = "knn",
trControl = trainControl("none"),
tuneGrid = data.frame(k = c(2)),
metric = "RMSE"
)
## fit model with k=30
<- train(
mod_knn_k30
myrecipe, data = ddf_train %>%
drop_na(),
method = "knn",
trControl = trainControl("none"),
tuneGrid = data.frame(k = c(30)),
metric = "RMSE"
)
8.4.3 Prediction
With the two models fitted above, predict "GPP_NT_VUT_REF"
for both and training and the testing sets, and evaluate them as above (metrics and visualisation).
Which model do you expect to perform better better on the training set and which to perform better on the testing set? Why?
## with k = 2
eval_model(
mod_knn_k2, df_train = ddf_train,
df_test = ddf_test
)
## with k = 30
eval_model(
mod_knn_k30, df_train = ddf_train,
df_test = ddf_test
)
8.4.4 Sample hyperparameters
Train a KNN model with hyperparameter (\(k\)) tuned, and with five-fold cross validation, using the training set. As the loss function, use RMSE. Sample the following values for \(k\): 2, 5, 10, 15, 18, 20, 22, 24, 26, 30, 35, 40, 60, 100. Visualise the RMSE as a function of \(k\).
Hint:
- The visualisation of cross-validation results can be visualised with the
plot(model_object)
ofggplot(model_object)
.
<- train(
mod_knn
myrecipe, data = ddf_train %>%
drop_na(),
method = "knn",
trControl = trainControl(method = "cv", number = 5),
tuneGrid = data.frame(k = c(2, 5, 10, 15, 20, 25, 30, 35, 40, 60, 100, 200, 300)),
metric = "RMSE"
)ggplot(mod_knn)
8.5 Random forest
8.5.1 Training
Fit a random forest model with ddf_train
and all predictors excluding "PA_F"
and five-fold cross validation. Use RMSE as the loss function.
Hints:
- Use the package ranger which implements the random forest algorithm.
- See here for information about hyperparameters available for tuning with caret.
- Set the argument
savePredictions = "final"
of functiontrainControl()
.
library(ranger)
## no pre-processing necessary
<- recipe(
myrecipe ~ TA_F + SW_IN_F + LW_IN_F + VPD_F + P_F + WS_F,
GPP_NT_VUT_REF data = ddf_train %>%
drop_na())
<- train(
rf
myrecipe, data = ddf_train %>%
drop_na(),
method = "ranger",
trControl = trainControl(method = "cv", number = 5, savePredictions = "final"),
tuneGrid = expand.grid( .mtry = floor(6 / 3),
.min.node.size = 5,
.splitrule = "variance"),
metric = "RMSE",
replace = FALSE,
sample.fraction = 0.5,
num.trees = 2000, # high number ok since no hperparam tuning
seed = 1982 # for reproducibility
) rf
## Random Forest
##
## 4143 samples
## 8 predictor
##
## Recipe steps:
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 3315, 3315, 3315, 3315, 3312
## Resampling results:
##
## RMSE Rsquared MAE
## 1.396752 0.7485248 1.055638
##
## Tuning parameter 'mtry' was held constant at a value of 2
## Tuning
## parameter 'splitrule' was held constant at a value of variance
##
## Tuning parameter 'min.node.size' was held constant at a value of 5
8.5.2 Prediction
Evaluate the trained model on the training and on the test set, giving metrics and a visualisation as above.
How are differences in performance to be interpreted? Compare the performances of linear regression, KNN, and random forest, considering the evaluation on the test set.
eval_model(rf, df_train = ddf_train, df_test = ddf_test)
Show the model performance (metrics and visualisation) on the validation sets all cross validation folds combined.
Do you expect it to be more similar to the model performance on the training set or the testing set in the evaluation above? Why?
<- rf$pred %>%
metrics_train ::metrics(obs, pred)
yardstick
<- metrics_train %>%
rmse_train filter(.metric == "rmse") %>%
pull(.estimate)
<- metrics_train %>%
rsq_train filter(.metric == "rsq") %>%
pull(.estimate)
$pred %>%
rfggplot(aes(obs, pred)) +
geom_hex() +
scale_fill_gradientn(
colours = colorRampPalette( c("gray65", "navy", "red", "yellow"))(5)) +
geom_abline(slope = 1, intercept = 0, linetype = "dotted") +
labs(subtitle = bquote( italic(R)^2 == .(format(rsq_train, digits = 2)) ~~
== .(format(rmse_train, digits = 3))),
RMSE title = "Validation folds in training set") +
theme_classic()