# Prepare data for the analysis

In this section you will learn how to prepare the data to train models using SDMtune. We will use the virtualSp dataset included in the package and environmental predictors from the WorldClim dataset.

First let’s load some packages that we will use to visualize the data:

library(ggplot2)    # To plot locations
library(maps)       # To access useful maps
library(rasterVis)  # To plot raster objects

## Acquire environmental variables

We use the climate data of WorldClim version 1.4 (Hijmans et al. 2005) and the terrestrial ecoregions from WWF (Olson et al. 2001) included in the dismo package:

files <- list.files(path = file.path(system.file(package = "dismo"), "ex"),
pattern = "grd", full.names = TRUE)

We convert these files in a raster stack object that will be used later in the analysis:

predictors <- raster::stack(files)

There are nine environmental variables, eight continuous and one categorical:

names(predictors)
• Continuous variables
• bio1 Annual Mean Temperature
• bio5 Max Temperature of Warmest Month
• bio6 Min Temperature of Coldest Month
• bio7 Temperature Annual Range (bio5-bio6)
• bio8 Mean Temperature of Wettest Quarter
• bio12 Annual Precipitation
• bio16 Precipitation of Wettest Quarter
• bio17 Precipitation of Driest Quarter
• Categorical variables
• biome Terrestrial Ecoregions of the World

We can plot bio1 using the gplot function from the rasterVis package:

gplot(predictors$bio1) + geom_tile(aes(fill = value)) + coord_equal() + scale_fill_gradientn(colours = c("#2c7bb6", "#abd9e9", "#ffffbf", "#fdae61", "#d7191c"), na.value = "transparent", name = "°C x 10") + labs(title = "Annual Mean Temperature", x = "longitude", y = "latitude") + scale_x_continuous(expand = c(0, 0)) + scale_y_continuous(expand = c(0, 0)) + theme_minimal() + theme(plot.title = element_text(hjust = 0.5), axis.ticks.x = element_blank(), axis.ticks.y = element_blank()) ## Prepare presence and background locations Let’s load the SDMtune package: library(SDMtune) For demonstrating how to use SDMtune we use the random generated virtualSp dataset included in the package. help(virtualSp) p_coords <- virtualSp$presence
bg_coords <- virtualSp\$background

Plot the study area together with the presence locations:

ggplot(map_data("world"), aes(long, lat)) +
geom_polygon(aes(group = group), fill = "grey95", color = "gray40", size = 0.2) +
geom_jitter(data = p_coords, aes(x = x, y = y), color = "red",
alpha = 0.4, size = 1) +
labs(x = "longitude", y = "latitude") +
theme_minimal() +
theme(legend.position = "none") +
coord_fixed() +
scale_x_continuous(limits = c(-125, -32)) +
scale_y_continuous(limits = c(-56, 40))

To plot the background locations run the following code:

ggplot(map_data("world"), aes(long, lat)) +
geom_polygon(aes(group = group), fill = "grey95", color = "gray40", size = 0.2) +
geom_jitter(data = as.data.frame(bg_coords), aes(x = x, y = y),
color = "blue", alpha = 0.4, size = 0.5) +
labs(x = "longitude", y = "latitude") +
theme_minimal() +
theme(legend.position = "none") +
coord_fixed() +
scale_x_continuous(limits = c(-125, -32)) +
scale_y_continuous(limits = c(-56, 40))

## Create an SWD object

Before training a model we have to prepare the data in the correct format. The prepareSWD function creates an SWD object that stores the species’ name, the coordinates of the species at presence and absence/background locations and the value of the environmental variables at the locations. The argument categorical indicates which environmental variables are categorical. In our example biome is categorical (we can pass a vector if we have more than one categorical environmental variable). The function extracts the value of the environmental variables for each location and excludes those locations that have NA value for at least one environmental variable.

data <- prepareSWD(species = "Virtual species", p = p_coords, a = bg_coords,
env = predictors, categorical = "biome")

## Explore the SWD object

Let’s have a look at the SWD object that we have just created:

data

When we print an SWD object we get a bunch of information:

• the name of the class;
• the name of the species;
• the number of presence locations;
• the number of absence/background locations;
• the environmental variables available in the dataset:
• the name of the continuous environmental variables, if any;
• the name of the categorical environmental variables, if any.

The object contains four slots: @species, @coords @data and @pa. @pa contains a vector with 1 for presence and 0 for absence/background locations. To visualize the data we run:

head(data@data)

We can visualize the coordinates with:

head(data@coords)

or the name of the species with:

data@species

## Save an SWD object

We can save the SWD object in a .csv file using the function swd2csv (the function saves the file in the working directory). There are two possibilities:

• save the object in a single file with the column pa indicating if the location is a presence (1) or an absence/background (0) site
swd2csv(data, file_name = "data.csv")
• save the object in two separate files: one for the presence and the other for the absence/background locations
swd2csv(data, file_name = c("presence.csv", "background.csv"))

# Train a model

SDMtune supports four methods for model training:

• Artificial Neural Networks ANN, using the nnet package (Venables and Ripley 2002);
• Boosted Regression Trees BRT, using the gbm package (Greenwell et al. 2019);
• Maximum Entropy with two implementations:
• Maxent using the dismo package (Hijmans et al. 2017);
• Maxnet using the maxnet package (Phillips 2017);
• Random Forest RF, using the randomForest package (Liaw and Wiener 2002).

The code necessary to train a model is the same for all the implementations. We will show how to train a Maxent model, you can adapt the code for the other methods or check the presence absence models vignette.

## Train a model with default settings

We use the function train to train a Maxent model. We need to provide two arguments:

• method: “Maxent” in our case;
• data: the SWD object with the presence and background locations.
model <- train(method = "Maxent", data = data)

The function trains the model using default settings that are:

• linear, quadratic, product and hinge feature class combinations;
• regularization multiplier equal to 1;
• 500 algorithm iterations.

We will see later how to change the default settings, for the moment let’s have a look at the model object.

## Explore an SDMmodel object

The output of the function train is an object of class SDMmodel. Let’s print the model object we’ve just created:

model

When we print an SDMmodel object we get the following information:

• the name of the class;
• the method used to train the model;
• the name of the species;
• the number of presence locations;
• the number of absence/background locations;
• the model configurations:
• fc: the feature class combinations;
• reg: the regularization multiplier;
• iter: the number of iterations;
• the environmental variables used to train the model:
• the name of the continuous environmental variables, if any;
• the name of the categorical environmental variables, if any.

An SDMmodel object has two slots:

slotNames(model)
• data: an SWD object with the presence absence/background locations used to train the model;
• model: a Maxent object, in our case, with all the model configurations.

The slot model contains the configurations of the model plus other information used to make predictions.

slotNames(model@model)

The most important are: fc, reg and iter that contain the values of the model configuration.

## Train a model changing the default settings

The function train() accepts optional arguments that can be used to change the default model settings. In our previous example we could have trained the same model using:

model <- train(method = "Maxent", data = data, fc = "lqph", reg = 1, iter = 500)

In the following example we train a model using linear and hinge as feature class combination, 0.5 as regularization multiplier and 700 iterations:

model <- train(method = "Maxent", data = data, fc = "lh", reg = 0.5, iter = 700)

By default Maxent models are trained using the arguments “removeduplicates=false” and “addsamplestobackground=false”. The user should have the full control of the data used to train the model, so is expected that duplicated locations are already removed and that the presence locations are already included in the background locations, when desired. You can use the function thinData to remove duplicated locations and the function addSamplesToBg to add the presence locations to the background locations.

# Make prediction

New locations are predicted with the function predict. The function takes three main arguments:

• a trained model given as SDMmodel object;
• a new dataset, used to make prediction (can be a data.frame, an SWD object or a raster stack object);
• the output type, that for Maxent models can be: raw, logistic or cloglog.

Next we get the prediction for our training locations using the cloglog output type:

pred <- predict(model, data = data, type = "cloglog")

The output in this case is a vector containing all the predicted values for the training locations:

head(pred)

We can get the prediction only for the presence location with:

p <- data@data[data@pa == 1, ]
pred <- predict(model, data = p, type = "cloglog")
tail(pred)

For models trained with the Maxent method, the function performs the prediction in R without calling the MaxEnt Java software. This results in a faster computation for large datasets and might result in a slightly different output compared to the Java software.

## Create a distribution map

We can use the same function to create a distribution map starting from the predictors raster stack’ object:

map <- predict(model, data = predictors, type = "cloglog")

In this case the output is a raster object:

map

The map can be saved in a file directly when running the prediction, we just have to pass additional arguments to the predict function. In the next example we save the map in a file called “my_file” in the GeoTIFF format:

map <- predict(model, data = predictors, type = "cloglog", file = "my_map",
format = "GTiff")

The function predict has other arguments useful when predicting large datasets:

• progress: can be set to "text" to visualize a progress bar;
• extent: can be passed to reduce the prediction to the given extent.

In the next example we restrict the prediction to Chile and plot the prediction:

# First create the extent that surrounds Cile
e = raster::extent(c(-77, -60, -56, -15))
# Now use the extent to make the prediction
map_e <- predict(model, data = predictors, type = "cloglog", extent = e)

## Plot a distribution map

To plot the distribution map we can use the function plotPred:

plotPred(map)

The function plotPred plots a map with a color ramp similar to the one used by the MaxEnt Java software. We can pass additional arguments to customize the map. In the next example we provide a custom color ramp and we add a title to the legend:

plotPred(map, lt = "Habitat\nsuitability",
colorramp = c("#2c7bb6", "#abd9e9", "#ffffbf", "#fdae61", "#d7191c"))

## Plot a presence/absence map

To plot a presence/absence map we need a threshold value that splits the prediction in presence and absence values. The function thresholds returns some commonly used threshold values starting from an SDMmodel object. In the next example we print the threshold values using the type "cloglog":

ths <- thresholds(model, type = "cloglog")
ths

For example we want to create a presence/absence map using the threshold that maximize the training sensitivity plus specificity. We use the function plotPA passing the threshold value as argument:

plotPA(map, th = ths[3, 2])

We can also save the map in a file with the following code:

plotPA(map, th = ths[3, 2], filename = "my_pa_map", format = "GTiff")

Both functions plotPred and plotPA have the argument hrto plot the the map with high resolution, useful when the map is used in a scientific publication.

# Evaluate a model

SDMtune implements three evaluation metrics:

• AUC: Area Under the ROC curve (Fielding and Bell 1997)
• TSS: True Skill Statistic (Allouche, Tsoar, and Kadmon 2006)
• AICc: Akaike Information Criterion corrected for small sample size (Warren and Seifert 2011)

We will compute the value of the metrics on the training dataset.

## AUC

The AUC can be calculated using the function auc:

auc(model)

We can also plot the ROC curve using the function plotROC:

plotROC(model)

## TSS

The TSS is computed with the function tss:

tss(model)

## AICc

For the AICc we use the function aicc. In this case we need to pass to the env argument the predictors raster stack object:

aicc(model, env = predictors)

## Training and testing

It’s always a good practice to split the species locations into two parts and use one part to train the model and the remaining part to evaluate it. We can use the trainValTest function for this purpose. Let’s say we want to use 80% of the species locations to train our model and 20% as testing dataset to evaluate it:

library(zeallot)  # For unpacking assignment
c(train, test) %<-% trainValTest(data, test = 0.2, only_presence = TRUE,
seed = 25)

Now we train the model using the train dataset:

model <- train("Maxent", data = train)

The only_presence argument is used to split only the presence and not the background locations. We can now evaluate the model using the testing dataset that has not been used to train the model:

auc(model)
auc(model, test = test)

We can plot the ROC curve for both, training and testing datasets, with:

plotROC(model, test = test)

This approach is valid when we have a large dataset. In our case, with only few observations, the evaluation depends strongly on how we split our presence locations. Let’s run a small experiment in which we perform different train/test splits and we compute the AUC:

# Create an empty data.frame
output <- data.frame(matrix(NA, nrow = 10, ncol = 3))
colnames(output) <- c("seed", "trainAUC", "testAUC")
# Create 10 different random seeds
set.seed(25)
seeds <- sample.int(1000, 10)

# Loop through the seeds
for (i in 1:length(seeds)) {
# Make the train/test split
c(train, test) %<-% trainValTest(data, test = 0.2, seed = seeds[i],
only_presence = TRUE)
# train the model
m <- train("Maxent", data = train)
# Populate the output data.frame
output[i, 1] <- seeds[i]
output[i, 2] <- auc(m)
output[i, 3] <- auc(m, test = test)
}

# Print the output
output
# compute the range of the testing AUC
range(output[, 3])

The testing AUC varies for the different train/test partitions. When we have to deal with a small dataset a better approach is the cross validation.

## Cross validation

To perform a cross validation in SDMtune we have to pass the fold argument to the train function. First we have to create the folds. There are several way to create them, here we explain how to make a random partition of 4 folds using the function randomFolds:

folds <- randomFolds(data, k = 4, only_presence = TRUE, seed = 25)

The output of the function is a list containing two matrices, the first for the training and the second for the testing locations. Each column of one matrix represents a fold with TRUE for the locations included in and FALSE excluded from the partition.

Let’s perform a 4 fold cross validation using the Maxent method (note that we use the full dataset):

cv_model <- train("Maxent", data = data, folds = folds)
cv_model

The output in this case is an SDMmodelCV object. It contains the four trained models in the models slot and the fold partitions in the folds slot. We can compute the AUC of a SDMmodelCV object using:

auc(cv_model)
auc(cv_model, test = TRUE)

this returns the AUC value averaged across the four different models.

## Spatial Cross Validation

The train() function accepts folds created with two other packages:

• ENMeval (Muscarella et al. 2014)
• blockCV (Valavi et al. 2019)

The function will convert internally the created folds into the correct format for SDMtune. These packages have specific function to create folds partitions that are spatially or environmentally independent.

Block partition using the ENMeval package:

library(ENMeval)
block_folds <- get.block(occ = data@coords[data@pa == 1, ],
bg.coords = data@coords[data@pa == 0, ])
model <- train(method = "Maxent", data = data, fc = "l", reg = 0.8,
folds = block_folds)

Checkerboard1 partition using the ENMeval package:

cb_folds <- get.checkerboard1(occ = data@coords[data@pa == 1, ],
env = predictors,
bg.coords = data@coords[data@pa == 0, ],
aggregation.factor = 4)
model <- train(method = "Maxent", data = data, fc = "l", reg = 0.8,
folds = cb_folds)

Environmental block using the package blockCV:

library(blockCV)
library(raster)
# Create spatial points data frame
sp_df <- SpatialPointsDataFrame(data@coords,
data = as.data.frame(data@pa),
proj4string = crs(predictors))
e_folds <- envBlock(rasterLayer = predictors, speciesData = sp_df,
species = "[email protected]", k = 4, standardization = "standard",
rasterBlock = FALSE, numLimit = 100)
model <- train(method = "Maxent", data = data, fc = "l", reg = 0.8,
folds = e_folds)`

# References

Allouche, Omri, Asaf Tsoar, and Ronen Kadmon. 2006. Assessing the accuracy of species distribution models: prevalence, kappa and the true skill statistic (TSS).” Journal of Applied Ecology 43 (6): 1223–32. https://doi.org/10.1111/j.1365-2664.2006.01214.x.
Fielding, Alan H., and John F. Bell. 1997. A review of methods for the assessment of prediction errors in conservation presence/absence models.” Environmental Conservation 24 (1): 38–49.
Greenwell, Brandon, Bradley Boehmke, Jay Cunningham, and GBM Developers. 2019. gbm: Generalized Boosted Regression Models. R package version 2.1.5.” https://CRAN.R-project.org/package=gbm.
Hijmans, Robert J., Susan E. Cameron, Juan L. Parra, Peter G. Jones, and Andy Jarvis. 2005. Very high resolution interpolated climate surfaces for global land areas.” International Journal of Climatology 25 (15): 1965–78. https://doi.org/10.1002/joc.1276.
Hijmans, Robert J., Steven Phillips, John Leathwick, and Jane Elith. 2017. dismo: Species Distribution Modeling. R package version 1.1-4.” https://cran.r-project.org/package=dismo.
Liaw, Andy, and Matthew Wiener. 2002. Classification and Regression by randomForest.” R News 2 (3): 18–22.
Muscarella, Robert, Peter J. Galante, Mariano Soley-Guardia, Robert A. Boria, Jamie M. Kass, María Uriarte, and Robert P. Anderson. 2014. ENMeval: An R package for conducting spatially independent evaluations and estimating optimal model complexity for Maxent ecological niche models.” Methods in Ecology and Evolution 5 (11): 1198–1205. https://doi.org/10.1111/2041-210X.12261.
Olson, David M., Eric Dinerstein, Eric D. Wikramanayake, Neil D. Burgess, George V. N. Powell, Emma C. Underwood, Jennifer A. D’amico, et al. 2001. Terrestrial Ecoregions of the World: A New Map of Life on Earth.” BioScience 51 (11): 933. https://doi.org/10.1641/0006-3568(2001)051[0933:TEOTWA]2.0.CO;2.
Phillips, Steven. 2017. maxnet: Fitting ’Maxent’ Species Distribution Models with ’glmnet’. R package version 0.1.2. https://cran.r-project.org/package=maxnet.
Valavi, Roozbeh, Jane Elith, José J. Lahoz-Monfort, and Gurutzeta Guillera-Arroita. 2019. block CV: An r package for generating spatially or environmentally separated folds for k -fold cross-validation of species distribution models.” Methods in Ecology and Evolution 10 (2): 225–32. https://doi.org/10.1111/2041-210X.13107.
Venables, W N, and B. D. Ripley. 2002. Modern Applied Statistics with S. Fourth Edi. New York, NY: Springer.
Warren, Dan L., and Stephanie N. Seifert. 2011. Ecological niche modeling in Maxent: the importance of model complexity and the performance of model selection criteria.” Ecological Applications 21 (2): 335–42. https://doi.org/10.1890/10-1171.1.