Chapter 7 Exercises

Now that you are familiar with the basic steps for supervised machine learning, you can get your hands on the data yourself and implement code for addressing the modelling task outlined in Chapter ??.

7.1 Reading and cleaning

Read the CSV file "./data/FLX_CH-Dav_FLUXNET2015_FULLSET_DD_1997-2014_1-3.csv", select all variables with name ending with "_F", the variables "TIMESTAMP", "GPP_NT_VUT_REF", and "NEE_VUT_REF_QC", and drop all variables that contain "JSB" in their name. Then convert the variable "TIMESTAMP" to a date-time object with the function ymd() from the lubridate package, and interpret all values -9999 as missing values. Then, set all values of "GPP_NT_VUT_REF" to missing if the corresponding quality control variable indicates that less than 90% are measured data points. Finally, drop the variable "NEE_VUT_REF_QC" - we won’t use it anymore.

## write your code here

7.2 Data splitting

Split the data a training and testing set, that contain 70% and 30% of the total available data, respectively.

## write your code here

7.3 Linear model

7.3.1 Training

Fit a linear regression model using the base-R function lm() and the training set. The target variable is "GPP_NT_VUT_REF", and predictor variables are all available meterological variables in the dataset. Answer the following questions:

What is the $R^2$ of predicted vs. observed "GPP_NT_VUT_REF"?
Is the linear regression slope significantly different from zero for all predictors?
Is a linear regression model with “poor” predictors removed better supported by the data than a model including all predictors?

## write your code here

Use caret and the function train() for fitting the same linear regression model (with all predictors) on the same data. Does it yield identical results as using lm() directly? You will have to set the argument trControl accordingly to avoid resampling, and instead fit the model on the all data in ddf_train. You can use summary() also on the object returned by the function train().

## write your code here

7.3.2 Prediction

With the model containing all predictors and fitted on ddf_train, make predictions using first ddf_train and then ddf_test. Compute the $R^2$ and the root-mean-square error, and visualise modelled vs. observed values to evaluate both predictions.

Do you expect the linear regression model trained on ddf_train to predict substantially better on ddf_train than on ddf_test? Why (not)?

Hints:

To calculate predictions, use the generic function predict() with the argument newdata = ....
The $R^2$ can be extracted from the model object as summary(model_object)$r.squared, or is (as the RMSE) given in the metrics data frame returned by metrics() from the yardstick library.
For visualisation the model performance, consider a scatterplot, or (better) a plot that reveals the density of overlapping points. (We’re plotting information from over 4000 data points here!)

## write your code here

7.4 KNN

7.4.1 Check data

## write your code here

The variable PA_F looks weird and was not significant in the linear model. Therefore, we won’t use it for the models below.

7.4.2 Training

Fit two KNN models on ddf_train (excluding "PA_F"), one with $k = 2$ and one with $k = 30$, both without resampling. Use the RMSE as the loss function. Center and scale data as part of the pre-processing and model formulation specification using the function recipe().

## write your code here

7.4.3 Prediction

With the two models fitted above, predict "GPP_NT_VUT_REF" for both and training and the testing sets, and evaluate them as above (metrics and visualisation).

Which model do you expect to perform better on the training set and which to perform better on the testing set? Why? Do you find evidence for overfitting in any of the models?

## write your code here

7.4.4 Sample hyperparameters

Train a KNN model with hyperparameter ($k$) tuned, and with five-fold cross validation, using the training set. As the loss function, use RMSE. Sample the following values for $k$: 2, 5, 10, 15, 18, 20, 22, 24, 26, 30, 35, 40, 60, 100. Visualise the RMSE as a function of $k$.

Hint:

The visualisation of cross-validation results can be visualised with the plot(model_object) of ggplot(model_object).

## write your code here

7.5 Random forest

7.5.1 Training

Fit a random forest model with ddf_train and all predictors excluding "PA_F" and five-fold cross validation. Use RMSE as the loss function.

Hints:

Use the package ranger which implements the random forest algorithm.
See here for information about hyperparameters available for tuning with caret.
Set the argument savePredictions = "final" of function trainControl().

## write your code here

7.5.2 Prediction

Evaluate the trained model on the training and on the test set, giving metrics and a visualisation as above.

How are differences in performance to be interpreted? Compare the performances of linear regression, KNN, and random forest, considering the evaluation on the test set.

## write your code here

Show the model performance (metrics and visualisation) on the validation sets all cross validation folds combined.

Do you expect it to be more similar to the model performance on the training set or the testing set in the evaluation above? Why?

## write your code here