Chapter 4 Pre-processing

Skewed data, outliers, and values covering multiple orders of magnitude can create difficulties for certain ML algorithms, e.g., or K-nearest neighbours, or neural networks. Other algorithms, like tree-based methods (e.g., Random Forest), are more robust against such issues.

4.1 Dealing with missingness and bad data

Several ML algorithms require missing values to be removed. That is, if any of the cells in one row has a missing value, the entire cell gets removed. Data may be missing for several reasons. Some yield random patterns of missing data, others not. In the latter case, we can speak of informative missingness (Kuhn & Johnson, 2003) and its information can be used for predictions. For categorical data, we may replace such data with "none" (instead of NA), while randomly missing data may be dropped altogether. Some ML algorithms (mainly tree-based methods, e.g., random forest) can handle missing values. However, when comparing the performance of alternative ML algorithms, they should be tested with the same data and removing missing data should be done beforehand.

Visualising missing data is essential for making decisions about dropping rows with missing data versus removing predictors from the model (which would imply too much data removal). The cells with missing data in a data frame can be eaily visualised e.g. with vis_miss() from the visdat package.

library(visdat)
vis_miss(
  ddf,
  cluster = FALSE, 
  warn_large_data = FALSE
  )

The question about what is “bad data” and whether or when it should be removed is often critical. Such decisions are important to keep track of and should be reported as transparently as possible in publications. In reality, where the data generation process may start in the field with actual human beings writing notes in a lab book, and where the human collecting the data is often not the same as the human writing the paper, it’s often more difficult to keep track of such decisions. As a general principle, it is advisable to design data records such that decisions made during its process remain transparent throughout all stages of the workflow and that sufficient information be collected to enable later revisions of particularly critical decisions.

4.2 Standardization

Several algorithms explicitly require data to be standardized. That is, values of all predictors vary within a comparable range. The necessity of this step becomes obvious when considering neural networks, the activation functions of each node have to deal with standardized inputs. In other words, inputs have to vary over the same range, expecting a mean of zero and standard deviation of one.)

To get a quick overview of the distribution of all variables (columns) in our data frame, we can use the skimr package.

library(skimr)
knitr::kable(skim(ddf))

skim_type	skim_variable	n_missing	complete_rate	Date.min	Date.max	Date.median	Date.n_unique	numeric.mean	numeric.sd	numeric.p0	numeric.p25	numeric.p50	numeric.p75	numeric.p100	numeric.hist
Date	TIMESTAMP	0	1.0000000	1997-01-01	2014-12-31	2005-12-31	6574	NA	NA	NA	NA	NA	NA	NA	NA
numeric	GPP_NT_VUT_REF	395	0.9399148	NA	NA	NA	NA	3.218728	2.7569372	-4.22996	0.7730585	2.87334	5.44718	12.2567	▁▇▆▃▁
numeric	TA_F	0	1.0000000	NA	NA	NA	NA	3.517397	6.6562542	-21.92400	-1.5565000	3.44450	8.72200	20.6870	▁▂▇▇▂
numeric	SW_IN_F	0	1.0000000	NA	NA	NA	NA	150.785747	85.0156424	3.30300	78.2630000	136.67700	215.54125	365.8880	▆▇▆▅▂
numeric	LW_IN_F	0	1.0000000	NA	NA	NA	NA	269.771156	41.9073945	138.12500	239.3937500	272.62150	303.36150	364.9070	▁▃▇▇▂
numeric	VPD_F	0	1.0000000	NA	NA	NA	NA	2.865737	2.3936778	0.00100	0.9950000	2.23900	4.05775	16.5650	▇▃▁▁▁
numeric	PA_F	0	1.0000000	NA	NA	NA	NA	83.564688	0.7261651	80.37300	83.1600000	83.68300	84.07200	85.6330	▁▁▅▇▁
numeric	P_F	0	1.0000000	NA	NA	NA	NA	2.304499	5.7860345	0.00000	0.0000000	0.00000	1.60000	92.1000	▇▁▁▁▁
numeric	WS_F	0	1.0000000	NA	NA	NA	NA	1.991029	0.6604529	0.32800	1.5410000	1.92200	2.33900	6.5390	▃▇▁▁▁

We see for example, that typical values of LW_IN_F are by a factor 100 larger than values of VPD_F. KNN uses the distance from neighbouring points for predictions. Obviously, in this case here, any distance would be dominated by LW_IN_F and distances in the “direction” of VPD_F, even when relatively large, would not be influential, neither for a Euclidean nor a Manhattan distance (see 2). In neural networks, activation functions take values in a given range (0-1). Thus, for both algorithms, data has to be standardized prior to model training.

Standardization is done, for example, by dividing each variable, that is all values in one column, by the standard deviation of that variable, and then subtracting its mean. This way, the resulting standardized values are centered around 0, and scaled such that a value of 1 means that the data point is one standard deviation above the mean of the respective variable (column). When applied to all predictors individually, the absolute values of their variations can be directly compared and only then it can be meaningfully used for determining the distance.

Standardization can be done not only by centering and scaling (as described above), but also by scaling to within range, where values are scaled such that the minimum value within each variable (column) is 0 and the maximum is 1.

In order to avoid data leakage, centering and scaling has to be done separately for each split into training and validation data (more on that later). In other words, don’t center and scale the entire data frame with the mean and standard deviation derived from the entire data frame, but instead center and scale with mean and standard deviation derived from the training portion of the data, and apply that also to the validation portion, when evaluating.

The caret package takes care of this. The R package caret provides a unified interface for using different ML algorithms implemented in separate packages. The preprocessing steps applied with each resampling fold can be specified using the function preProcess(). More on resampling in Chapter 6.

library(caret)
pp <- preProcess(ddf_train, method = c("center", "scale"))

As seen above for the feature engineering example, this does not return a standardized version of the data frame ddf. Rather, it returns the information that allows us to apply the same standardization also to other data sets. In other words, we use the distribution of values in the data set to which we applied the function to determine the centering and scaling (here: mean and standard deviation).

4.3 More pre-processing

Depending on the algorithm and the data, additional pre-processing steps may be required. You can find more information about this in the great and freely available online tutorial Hands-On Machine Learning in R.

One such additional pre-processing step is imputation, where missing values are imputed (gap-filled), for example by the mean of each variable respectively. Also imputation is prone to cause data leakage and must therefore be implemented as part of the resampling and training workflow. The recipes package offers a great way to deal with imputation (and also all other pre-processing steps). Here is a link to learn more about it.