library(occuR)
#> Loading required package: mgcv
#> Loading required package: nlme
#> This is mgcv 1.8-35. For overview type 'help("mgcv-package")'.
#> Loading required package: data.table
#> Loading required package: TMB
#> Loading required package: RcppEigen
#> Loading required package: Matrix
library(ggplot2)
The occuR
package is used to fit multi-occasion occupancy models where occupancy and detection probability can depend smoothly on covariates (using mgcv
as a back-end). Models are fit using TMB
(Template model builder).
In this vignette, the model is briefly explained, look at some example data, and how to perform an example analysis. The example data and fitted models can be accessed using data(example_analysis)
.
The multi-occasion occupancy survey is considered to have the following structure: space is divided into sites which are surveyed over one or more occasions by being visited one or more times on one or more occasions. What do these words mean?
Site : A site is a single spatial unit that is considered to either be occupied by the species or not.
Occasion : An occasion is a period of time over which occupancy is defined, i.e., a site is termed occupied if the target species uses that site at least once during the occasion.
Visit : A visit denotes a period of time where a site has been surveyed and it is recorded whether or not the target species was detected within the site during that time.
What period of time qualifies as an occasion and what qualifies as a visit is up to the discretion of the analyst: choose them in order for inference to be made on a temporal scale that is meaningful for the target species and the decision-making on which the inferece would have impact.
The model has two important probabilities:
\(\psi_{i, j}\) is the probability a site \(i\) is occupied in occasion \(j\);
\(p_{i,j,k}\) is the probability the target species is detected in site \(i\) in occasion \(j\) on visit \(k\) given the site is occupied in that occasion.
Notice, that occupancy probability can change by site or occasion (or by covariates that change with site or occasion or both); detection probability can change with site, occasion, and visit (and with covariates that vary with these). In particular, occupancy and detection probability can be a smooth function of covariates within the generalized additive modelling framework, see the mgcv
package.
This approach to multi-occasion occupancy modelling differs from the hidden Markov model approach where site occupancy is a Markov process over time; here, site occupancy is independent over time conditional on the occupancy probability — correlation of site occupancy over space and time is induced by allowing occupancy probability to be a smooth function of space and time.
Data should be formatted into two data tables. Data tables are R
objects created using the data.table
R package (which is optimised for large data sets). The two objects will be called visit_data
and site_data
.
Here is an example of the visit_data
data table:
visit_data
#> site occasion visit temp hab obs
#> 1: 1 1 1 15.98398199266 arable 0
#> 2: 1 1 2 15.00442978144 arable 0
#> 3: 1 1 3 17.54389157200 arable 0
#> 4: 1 1 4 11.97305751881 arable 0
#> 5: 1 1 5 6.46933726589 arable 0
#> ---
#> 4996: 100 5 6 12.72368785100 woodland 1
#> 4997: 100 5 7 9.14544302447 woodland 1
#> 4998: 100 5 8 17.98663985781 woodland 0
#> 4999: 100 5 9 14.77650803376 woodland 0
#> 5000: 100 5 10 23.67549798965 woodland 0
The columns which are essential are “site” (which site the visit took place on), “occasion” (in which occasion the visit took place), “visit” (which visit this refers to), and “obs” (whether or not the target species was detected).
The remaining columns are covariates defined for each visit. The covariate temp
is the air temperature on each visit since for this example there is an hypothesis this temperature affects the availability of the target species (it comes out only a specific temperatures) — notice this: detection probability is not only a model of how well recorders can detect species but also how available the species are for detection. Moreover, detection probability can model the occurrence of a species within occasion (for example if month or day of year were used as a covariate). The other covariate is hab
, the habitat type for that site, this does change over visits but must be included in visit_data
if I want to model how detection probability changes with habitat type. So, in general, all covariates that affect \(p\) must be in a column of visit_data
.
Let’s look at site_data
:
site_data
#> site occasion hab x y
#> 1: 1 1 arable 1.00910314396 0.1997705620675
#> 2: 1 2 woodland 1.00910314396 0.1997705620675
#> 3: 1 3 woodland 1.00910314396 0.1997705620675
#> 4: 1 4 woodland 1.00910314396 0.1997705620675
#> 5: 1 5 urban 1.00910314396 0.1997705620675
#> ---
#> 496: 100 1 woodland -2.09301820892 0.0829478309869
#> 497: 100 2 urban -2.09301820892 0.0829478309869
#> 498: 100 3 urban -2.09301820892 0.0829478309869
#> 499: 100 4 urban -2.09301820892 0.0829478309869
#> 500: 100 5 arable -2.09301820892 0.0829478309869
The essential columns in site_data
are “site”, “occasion” (occasions are numbered 1 upwards, but not every site must be included for every occasion). The remaining columns are covariates: hab
the habitat type and geographic coordinates x
and y
. Any covariates I want to use when modelling \(psi\) must be included as a column in site_data
.
Overall, it is important to remember that \(p\) can change for each visit and the information in each row of visit_data
allows us to predict \(p\) for each visit; \(psi\) can only change for each site on each occasion and so each row of site_data
contains the information to predict \(psi\).
The basic model has a constant for \(psi\) and \(p\).
m0 <- fit_occu(list(psi ~ 1, p ~ 1), visit_data, site_data)
The function fit_occu
is used to fit the model. It accepts a list of two formulae: one for \(psi\) and one for \(p\); it also requires the visit data and site data to be given (the order matters). Here I specify just an intercept model. When fitting, unless the print
argument is used in fit_occu
, some output is printed to the screen. This output is produced by the TMB
package and may be useful if model fitting fails.
m0
#> estimate sd lcl ucl
#> psi: -0.03992 0.08946 -0.2153 0.1354
#> p: 0.55410 0.04198 0.4718 0.6363
#> Total edf: 2 smooth_psi: 0 smooth_p: 0
The model output returns the estimate, standard deviation (of the estimator), and a \(95\%\) confidence interval (the confidence level can be controlled with the conf.level
argument for the summary
function).
The link function used for both probabilities is the logit function, so to translate to the response scale:
plogis(-0.03992)
#> [1] 0.490021325138
plogis(0.55410)
#> [1] 0.635086298169
Fixed effects (by which I mean no nonparametric smooths) you can include as with simple linear regression in R
with the lm
function.
For example, suppose I want to fit a model where \(psi\) depends on habitat:
m_psihab <- fit_occu(list(psi ~ hab, p ~ 1), visit_data, site_data)
m_psihab
#> estimate sd lcl ucl
#> psi: (Intercept) 0.2931 0.19280 -0.0847 0.6709
#> psi: haburban 0.1125 0.23600 -0.3500 0.5749
#> psi: habwoodland -1.2140 0.25870 -1.7210 -0.7066
#> p: 0.5541 0.04198 0.4718 0.6363
#> Total edf: 4 smooth_psi: 0 smooth_p: 0
We can compare this model to the basis model by AIC:
AIC(m0, m_psihab)
#> df AIC
#> m0 2 3912.22994411
#> m_psihab 4 3873.46141788
There is evidence to retain the habitat variable on \(psi\).
We can do the same with \(p\) and temperature:
m_temp <- fit_occu(list(psi ~ hab, p ~ temp), visit_data, site_data)
m_temp
#> estimate sd lcl ucl
#> psi: (Intercept) 0.29310 0.192800 -0.08470 0.67090
#> psi: haburban 0.11250 0.236000 -0.35000 0.57490
#> psi: habwoodland -1.21400 0.258700 -1.72100 -0.70660
#> p: (Intercept) 1.27300 0.115300 1.04700 1.49900
#> p: temp -0.05891 0.008649 -0.07586 -0.04196
#> Total edf: 5 smooth_psi: 0 smooth_p: 0
AIC(m_psihab, m_temp)
#> df AIC
#> m_psihab 4 3873.46141788
#> m_temp 5 3827.70077372
Again, some evidence of temperature effect. Yet, this is just a linear effect. It is possible temperature has a non-linear effect.
Let’s fit a model with a smooth effect for temperature on \(p\):
m_temps <- fit_occu(list(psi ~ hab, p ~ s(temp, bs = "cs")), visit_data, site_data)
m_temps
#> estimate sd lcl ucl
#> psi: (Intercept) 0.2931 0.19280 -0.08471 0.6709
#> psi: haburban 0.1125 0.23600 -0.35000 0.5749
#> psi: habwoodland -1.2140 0.25870 -1.72100 -0.7066
#> p: (Intercept) 0.5834 0.04393 0.49730 0.6695
#> Total edf: 12.9 smooth_psi: 0 smooth_p: 8.9
AIC(m_temps, m_temp)
#> df AIC
#> m_temps 12.8700238158 3761.34986007
#> m_temp 5.0000000000 3827.70077372
First, notice that (for now) only splines with basis of “cs” or “ts” are well defined for this package and you must specify one of these for every smooth.
Now, notice in the output for the model that it reports the effective degrees of freedom for the model as a whole (total edf) and for each parameter separately. The edf is not a whole number because smooths have parameters that are correlated and so have an effective degrees of freedom less than the total number of parameters in the model. This is well explained for generalised additive models as in mgcv
. The AIC is also corrected for the effective degrees of freedom of the smooth. We see that there is evidence that temperature has a non-linear effect.
One thing to note is the reported effective degrees of freedom for each parameter \(p\)/\(psi\) separately. When fitting smooths in mgcv
there is a argument k
that species the number of knots before penalization. See ?choose.k
in mgcv
for full information. In this case, k
is \(10\) meaning that the maximum degrees of freedom for p
is \(9\) (it is always \(k\) - 1 because smooths are constrained to sum to zero and so lose one degree of freedom automatically). We see for this model the edf for \(p\) is 8.9 which means it has reached the maximum allowable. This indicates we should increase k
to allow \(p\) to explore greater flexibility.
m_temps <- fit_occu(list(psi ~ hab, p ~ s(temp, bs = "cs", k = 20)), visit_data, site_data)
Once, the edf is clearly less than k
, we can continue with the analysis. For this example, we will continue with \(k = 10\).
We can also include smooths over time on either paramater:
m_pt <- fit_occu(list(psi ~ hab, p ~ s(temp, bs = "cs") + s(occasion, bs = "cs", k = 5)), visit_data, site_data)
m_psit <- fit_occu(list(psi ~ hab + s(occasion, bs = "cs", k = 5), p ~ s(temp, bs = "cs") + s(occasion, bs = "cs", k = 5)), visit_data, site_data)
AIC(m_temps, m_pt, m_psit)
#> df AIC
#> m_temps 12.8700238158 3761.34986007
#> m_pt 16.8593686836 3705.17669282
#> m_psit 19.1068704754 3711.76096822
We can see that AIC supports a model where \(p\) is occasion-varying but not \(psi\) — that is, there is no evidence that \(psi\) changes over time; no trend in occupancy.
Also, note that since ther are only five occasions, \(k = 5\) is a hard upper limit we must set.
We can also consider a smooth where \(psi\) changes by location \((x,y)\) and by occasion. One way to do this is to specify a two-dimensional smooth over space and a one-dimensional smooth over occasion that interaction in a tensor product spline. For this package, only “t2” tensor products are well-defined.
m_psi_xyt <- fit_occu(list(psi ~ t2(x, y, occasion, bs = c("ts", "cs"), d = c(2, 1)) + hab,
p ~ s(temp, bs = "cs") + s(occasion, bs = "cs", k = 5)), visit_data, site_data)
For this smooth, we specify d = c(2, 1)
to mean the first spline is a two-dimensional spline for space, corresponding to \((x,y)\), and a one-dimensional spline for occasion.
Suppose we decide to use model m_psi_xyt
to predict the estimated relationships. We shall rename the model m
for brevity. Below I show how to use the predict
function to plot these relationships. It works just as with the predict
function for lm
models or gam
models with the difference that you must supply both visit_data
and site_data
, providing newdata for these according to what you with the predict for.
# temperature effect
tempgr <- seq(-5, 25, 0.1)
pred_temp <- predict(m, data.table(occasion = 1, temp = tempgr), site_data, nboot = 1000)
ci <- apply(pred_temp$pboot, 2, quantile, prob = c(0.025, 0.975))
plot(tempgr, pred_temp$p, type = "l", lwd = 1.5, ylim = c(min(ci[1,]), max(ci[2,])), xlab = "Temperature", ylab = "Detection Probability")
lines(tempgr, ci[1,], lty = "dotted")
lines(tempgr, ci[2,], lty = "dotted")
# occasion effect
pred_occ <- predict(m, data.table(occasion = 1:nocc, temp = 18), site_data, nboot = 1000)
ci <- apply(pred_occ$pboot, 2, quantile, prob = c(0.025, 0.975))
plot(1:nocc, pred_occ$p, type = "b", pch = 19, lwd = 1.5, ylim = c(min(ci[1,]), max(ci[2,])), xlab = "Temperature", ylab = "Detection Probability")
lines(1:nocc, ci[1,], lty = "dotted")
lines(1:nocc, ci[2,], lty = "dotted")
# spatial effect
xgr <- seq(-2, 2.5, 0.1)
ygr <- seq(-2, 2.5, 0.1)
gr <- expand.grid(xgr, ygr)
pred_xy <- predict(m, visit_data, data.table(occasion = 1, x = gr[,1], y = gr[,2], hab = "arable"), nboot = 1000)
ggplot() +
geom_tile(aes(x = gr[,1], y = gr[,2], fill = pred_xy$psi)) +
theme_bw() +
scale_x_continuous("x") +
scale_y_continuous("y") +
scale_fill_viridis_c("Occupancy")
# spatio-temporal effect
xgr <- rep(gr[,1], nocc)
ygr <- rep(gr[,2], nocc)
tgr <- rep(1:nocc, each = nrow(gr))
pred_xyt <- predict(m, visit_data, data.table(occasion = tgr, x = xgr, y = ygr, hab = "arable"), nboot = 1000)
ggplot(data.frame(x = xgr, y = ygr, t = tgr, psi = pred_xyt$psi)) +
geom_tile(aes(x = x, y = y, group = t, fill = psi)) +
theme_bw() +
facet_wrap(~t) +
scale_x_continuous("x") +
scale_y_continuous("y") +
scale_fill_viridis_c("Occupancy")