Chapter 7 Exercises
Now that you are familiar with the basic steps for supervised machine learning, you can get your hands on the data yourself and implement code for addressing the modelling task outlined in Chapter ??.
7.1 Reading and cleaning
Read the CSV file "./data/FLX_CH-Dav_FLUXNET2015_FULLSET_DD_1997-2014_1-3.csv"
, select all variables with name ending with "_F"
, the variables "TIMESTAMP"
, "GPP_NT_VUT_REF"
, and "NEE_VUT_REF_QC"
, and drop all variables that contain "JSB"
in their name. Then convert the variable "TIMESTAMP"
to a date-time object with the function ymd()
from the lubridate package, and interpret all values -9999
as missing values. Then, set all values of "GPP_NT_VUT_REF"
to missing if the corresponding quality control variable indicates that less than 90% are measured data points. Finally, drop the variable "NEE_VUT_REF_QC"
- we won’t use it anymore.
## write your code here
7.2 Data splitting
Split the data a training and testing set, that contain 70% and 30% of the total available data, respectively.
## write your code here
7.3 Linear model
7.3.1 Training
Fit a linear regression model using the base-R function lm()
and the training set. The target variable is "GPP_NT_VUT_REF"
, and predictor variables are all available meterological variables in the dataset. Answer the following questions:
- What is the \(R^2\) of predicted vs. observed
"GPP_NT_VUT_REF"
? - Is the linear regression slope significantly different from zero for all predictors?
- Is a linear regression model with “poor” predictors removed better supported by the data than a model including all predictors?
## write your code here
Use caret and the function train()
for fitting the same linear regression model (with all predictors) on the same data. Does it yield identical results as using lm()
directly? You will have to set the argument trControl
accordingly to avoid resampling, and instead fit the model on the all data in ddf_train
. You can use summary()
also on the object returned by the function train()
.
## write your code here
7.3.2 Prediction
With the model containing all predictors and fitted on ddf_train
, make predictions using first ddf_train
and then ddf_test
. Compute the \(R^2\) and the root-mean-square error, and visualise modelled vs. observed values to evaluate both predictions.
Do you expect the linear regression model trained on ddf_train
to predict substantially better on ddf_train
than on ddf_test
? Why (not)?
Hints:
- To calculate predictions, use the generic function
predict()
with the argumentnewdata = ...
. - The \(R^2\) can be extracted from the model object as
summary(model_object)$r.squared
, or is (as the RMSE) given in the metrics data frame returned bymetrics()
from the yardstick library. - For visualisation the model performance, consider a scatterplot, or (better) a plot that reveals the density of overlapping points. (We’re plotting information from over 4000 data points here!)
## write your code here
7.4 KNN
7.4.1 Check data
## write your code here
The variable PA_F
looks weird and was not significant in the linear model. Therefore, we won’t use it for the models below.
7.4.2 Training
Fit two KNN models on ddf_train
(excluding "PA_F"
), one with \(k = 2\) and one with \(k = 30\), both without resampling. Use the RMSE as the loss function. Center and scale data as part of the pre-processing and model formulation specification using the function recipe()
.
## write your code here
7.4.3 Prediction
With the two models fitted above, predict "GPP_NT_VUT_REF"
for both and training and the testing sets, and evaluate them as above (metrics and visualisation).
Which model do you expect to perform better on the training set and which to perform better on the testing set? Why? Do you find evidence for overfitting in any of the models?
## write your code here
7.4.4 Sample hyperparameters
Train a KNN model with hyperparameter (\(k\)) tuned, and with five-fold cross validation, using the training set. As the loss function, use RMSE. Sample the following values for \(k\): 2, 5, 10, 15, 18, 20, 22, 24, 26, 30, 35, 40, 60, 100. Visualise the RMSE as a function of \(k\).
Hint:
- The visualisation of cross-validation results can be visualised with the
plot(model_object)
ofggplot(model_object)
.
## write your code here
7.5 Random forest
7.5.1 Training
Fit a random forest model with ddf_train
and all predictors excluding "PA_F"
and five-fold cross validation. Use RMSE as the loss function.
Hints:
- Use the package ranger which implements the random forest algorithm.
- See here for information about hyperparameters available for tuning with caret.
- Set the argument
savePredictions = "final"
of functiontrainControl()
.
## write your code here
7.5.2 Prediction
Evaluate the trained model on the training and on the test set, giving metrics and a visualisation as above.
How are differences in performance to be interpreted? Compare the performances of linear regression, KNN, and random forest, considering the evaluation on the test set.
## write your code here
Show the model performance (metrics and visualisation) on the validation sets all cross validation folds combined.
Do you expect it to be more similar to the model performance on the training set or the testing set in the evaluation above? Why?
## write your code here