library(occuR)
#> Loading required package: mgcv
#> Loading required package: nlme
#> This is mgcv 1.8-35. For overview type 'help("mgcv-package")'.
#> Loading required package: data.table
#> Loading required package: TMB
#> Loading required package: RcppEigen
#> Loading required package: Matrix
library(ggplot2)

The occuR package is used to fit multi-occasion occupancy models where occupancy and detection probability can depend smoothly on covariates (using mgcv as a back-end). Models are fit using TMB (Template model builder).

In this vignette, the model is briefly explained, look at some example data, and how to perform an example analysis. The example data and fitted models can be accessed using data(example_analysis).

Model

The multi-occasion occupancy survey is considered to have the following structure: space is divided into sites which are surveyed over one or more occasions by being visited one or more times on one or more occasions. What do these words mean?

Site : A site is a single spatial unit that is considered to either be occupied by the species or not.
Occasion : An occasion is a period of time over which occupancy is defined, i.e., a site is termed occupied if the target species uses that site at least once during the occasion.
Visit : A visit denotes a period of time where a site has been surveyed and it is recorded whether or not the target species was detected within the site during that time.

What period of time qualifies as an occasion and what qualifies as a visit is up to the discretion of the analyst: choose them in order for inference to be made on a temporal scale that is meaningful for the target species and the decision-making on which the inferece would have impact.

The model has two important probabilities:

\(\psi_{i, j}\) is the probability a site \(i\) is occupied in occasion \(j\);
\(p_{i,j,k}\) is the probability the target species is detected in site \(i\) in occasion \(j\) on visit \(k\) given the site is occupied in that occasion.

Notice, that occupancy probability can change by site or occasion (or by covariates that change with site or occasion or both); detection probability can change with site, occasion, and visit (and with covariates that vary with these). In particular, occupancy and detection probability can be a smooth function of covariates within the generalized additive modelling framework, see the mgcv package.

This approach to multi-occasion occupancy modelling differs from the hidden Markov model approach where site occupancy is a Markov process over time; here, site occupancy is independent over time conditional on the occupancy probability — correlation of site occupancy over space and time is induced by allowing occupancy probability to be a smooth function of space and time.

Example Data

Data should be formatted into two data tables. Data tables are R objects created using the data.table R package (which is optimised for large data sets). The two objects will be called visit_data and site_data.

Here is an example of the visit_data data table:

visit_data 
#>       site occasion visit           temp      hab obs
#>    1:    1        1     1 15.98398199266   arable   0
#>    2:    1        1     2 15.00442978144   arable   0
#>    3:    1        1     3 17.54389157200   arable   0
#>    4:    1        1     4 11.97305751881   arable   0
#>    5:    1        1     5  6.46933726589   arable   0
#>   ---                                                
#> 4996:  100        5     6 12.72368785100 woodland   1
#> 4997:  100        5     7  9.14544302447 woodland   1
#> 4998:  100        5     8 17.98663985781 woodland   0
#> 4999:  100        5     9 14.77650803376 woodland   0
#> 5000:  100        5    10 23.67549798965 woodland   0

The columns which are essential are “site” (which site the visit took place on), “occasion” (in which occasion the visit took place), “visit” (which visit this refers to), and “obs” (whether or not the target species was detected).

The remaining columns are covariates defined for each visit. The covariate temp is the air temperature on each visit since for this example there is an hypothesis this temperature affects the availability of the target species (it comes out only a specific temperatures) — notice this: detection probability is not only a model of how well recorders can detect species but also how available the species are for detection. Moreover, detection probability can model the occurrence of a species within occasion (for example if month or day of year were used as a covariate). The other covariate is hab, the habitat type for that site, this does change over visits but must be included in visit_data if I want to model how detection probability changes with habitat type. So, in general, all covariates that affect \(p\) must be in a column of visit_data.

Let’s look at site_data:

site_data
#>      site occasion      hab              x               y
#>   1:    1        1   arable  1.00910314396 0.1997705620675
#>   2:    1        2 woodland  1.00910314396 0.1997705620675
#>   3:    1        3 woodland  1.00910314396 0.1997705620675
#>   4:    1        4 woodland  1.00910314396 0.1997705620675
#>   5:    1        5    urban  1.00910314396 0.1997705620675
#>  ---                                                      
#> 496:  100        1 woodland -2.09301820892 0.0829478309869
#> 497:  100        2    urban -2.09301820892 0.0829478309869
#> 498:  100        3    urban -2.09301820892 0.0829478309869
#> 499:  100        4    urban -2.09301820892 0.0829478309869
#> 500:  100        5   arable -2.09301820892 0.0829478309869

The essential columns in site_data are “site”, “occasion” (occasions are numbered 1 upwards, but not every site must be included for every occasion). The remaining columns are covariates: hab the habitat type and geographic coordinates x and y. Any covariates I want to use when modelling \(psi\) must be included as a column in site_data.

Overall, it is important to remember that \(p\) can change for each visit and the information in each row of visit_data allows us to predict \(p\) for each visit; \(psi\) can only change for each site on each occasion and so each row of site_data contains the information to predict \(psi\).

Fitting the Basic Model

The basic model has a constant for \(psi\) and \(p\).

m0 <- fit_occu(list(psi ~ 1, p ~ 1), visit_data, site_data)

The function fit_occu is used to fit the model. It accepts a list of two formulae: one for \(psi\) and one for \(p\); it also requires the visit data and site data to be given (the order matters). Here I specify just an intercept model. When fitting, unless the print argument is used in fit_occu, some output is printed to the screen. This output is produced by the TMB package and may be useful if model fitting fails.

m0
#>       estimate      sd     lcl    ucl
#> psi:  -0.03992 0.08946 -0.2153 0.1354
#> p:     0.55410 0.04198  0.4718 0.6363
#> Total edf: 2  smooth_psi: 0   smooth_p: 0

The model output returns the estimate, standard deviation (of the estimator), and a \(95\%\) confidence interval (the confidence level can be controlled with the conf.level argument for the summary function).

The link function used for both probabilities is the logit function, so to translate to the response scale:

plogis(-0.03992) 
#> [1] 0.490021325138
plogis(0.55410)
#> [1] 0.635086298169

Fixed Effect Models

Fixed effects (by which I mean no nonparametric smooths) you can include as with simple linear regression in R with the lm function.

For example, suppose I want to fit a model where \(psi\) depends on habitat:

m_psihab <- fit_occu(list(psi ~ hab, p ~ 1), visit_data, site_data)

m_psihab
#>                  estimate      sd     lcl     ucl
#> psi: (Intercept)   0.2931 0.19280 -0.0847  0.6709
#> psi: haburban      0.1125 0.23600 -0.3500  0.5749
#> psi: habwoodland  -1.2140 0.25870 -1.7210 -0.7066
#> p:                 0.5541 0.04198  0.4718  0.6363
#> Total edf: 4  smooth_psi: 0   smooth_p: 0

We can compare this model to the basis model by AIC:

AIC(m0, m_psihab)
#>          df           AIC
#> m0        2 3912.22994411
#> m_psihab  4 3873.46141788

There is evidence to retain the habitat variable on \(psi\).

We can do the same with \(p\) and temperature:

m_temp <- fit_occu(list(psi ~ hab, p ~ temp), visit_data, site_data)

m_temp
#>                  estimate       sd      lcl      ucl
#> psi: (Intercept)  0.29310 0.192800 -0.08470  0.67090
#> psi: haburban     0.11250 0.236000 -0.35000  0.57490
#> psi: habwoodland -1.21400 0.258700 -1.72100 -0.70660
#> p: (Intercept)    1.27300 0.115300  1.04700  1.49900
#> p: temp          -0.05891 0.008649 -0.07586 -0.04196
#> Total edf: 5  smooth_psi: 0   smooth_p: 0
AIC(m_psihab, m_temp)
#>          df           AIC
#> m_psihab  4 3873.46141788
#> m_temp    5 3827.70077372

Again, some evidence of temperature effect. Yet, this is just a linear effect. It is possible temperature has a non-linear effect.

Smooth effects

Let’s fit a model with a smooth effect for temperature on \(p\):

m_temps <- fit_occu(list(psi ~ hab, p ~ s(temp, bs = "cs")), visit_data, site_data)

m_temps
#>                  estimate      sd      lcl     ucl
#> psi: (Intercept)   0.2931 0.19280 -0.08471  0.6709
#> psi: haburban      0.1125 0.23600 -0.35000  0.5749
#> psi: habwoodland  -1.2140 0.25870 -1.72100 -0.7066
#> p: (Intercept)     0.5834 0.04393  0.49730  0.6695
#> Total edf: 12.9   smooth_psi: 0   smooth_p: 8.9
AIC(m_temps, m_temp)
#>                    df           AIC
#> m_temps 12.8700238158 3761.34986007
#> m_temp   5.0000000000 3827.70077372

First, notice that (for now) only splines with basis of “cs” or “ts” are well defined for this package and you must specify one of these for every smooth.

Now, notice in the output for the model that it reports the effective degrees of freedom for the model as a whole (total edf) and for each parameter separately. The edf is not a whole number because smooths have parameters that are correlated and so have an effective degrees of freedom less than the total number of parameters in the model. This is well explained for generalised additive models as in mgcv. The AIC is also corrected for the effective degrees of freedom of the smooth. We see that there is evidence that temperature has a non-linear effect.

One thing to note is the reported effective degrees of freedom for each parameter \(p\)/\(psi\) separately. When fitting smooths in mgcv there is a argument k that species the number of knots before penalization. See ?choose.k in mgcv for full information. In this case, k is \(10\) meaning that the maximum degrees of freedom for p is \(9\) (it is always \(k\) - 1 because smooths are constrained to sum to zero and so lose one degree of freedom automatically). We see for this model the edf for \(p\) is 8.9 which means it has reached the maximum allowable. This indicates we should increase k to allow \(p\) to explore greater flexibility.

m_temps <- fit_occu(list(psi ~ hab, p ~ s(temp, bs = "cs", k = 20)), visit_data, site_data)

Once, the edf is clearly less than k, we can continue with the analysis. For this example, we will continue with \(k = 10\).

Temporal effects

We can also include smooths over time on either paramater:

m_pt <- fit_occu(list(psi ~ hab, p ~ s(temp, bs = "cs") + s(occasion, bs = "cs", k = 5)), visit_data, site_data)
m_psit <- fit_occu(list(psi ~ hab + s(occasion, bs = "cs", k = 5), p ~ s(temp, bs = "cs") + s(occasion, bs = "cs", k = 5)), visit_data, site_data)

AIC(m_temps, m_pt, m_psit)
#>                    df           AIC
#> m_temps 12.8700238158 3761.34986007
#> m_pt    16.8593686836 3705.17669282
#> m_psit  19.1068704754 3711.76096822

We can see that AIC supports a model where \(p\) is occasion-varying but not \(psi\) — that is, there is no evidence that \(psi\) changes over time; no trend in occupancy.

Also, note that since ther are only five occasions, \(k = 5\) is a hard upper limit we must set.

Spatio-temporal effects

We can also consider a smooth where \(psi\) changes by location \((x,y)\) and by occasion. One way to do this is to specify a two-dimensional smooth over space and a one-dimensional smooth over occasion that interaction in a tensor product spline. For this package, only “t2” tensor products are well-defined.

m_psi_xyt <- fit_occu(list(psi ~ t2(x, y, occasion, bs = c("ts", "cs"), d = c(2, 1)) + hab, 
                           p ~ s(temp, bs = "cs") + s(occasion, bs = "cs", k = 5)), visit_data, site_data)

For this smooth, we specify d = c(2, 1) to mean the first spline is a two-dimensional spline for space, corresponding to \((x,y)\), and a one-dimensional spline for occasion.

Predictions

Suppose we decide to use model m_psi_xyt to predict the estimated relationships. We shall rename the model m for brevity. Below I show how to use the predict function to plot these relationships. It works just as with the predict function for lm models or gam models with the difference that you must supply both visit_data and site_data, providing newdata for these according to what you with the predict for.

# temperature effect 
tempgr <- seq(-5, 25, 0.1)
pred_temp <- predict(m, data.table(occasion = 1, temp = tempgr), site_data, nboot = 1000)
ci <- apply(pred_temp$pboot, 2, quantile, prob = c(0.025, 0.975))
plot(tempgr, pred_temp$p, type = "l", lwd = 1.5, ylim = c(min(ci[1,]), max(ci[2,])), xlab = "Temperature", ylab = "Detection Probability")
lines(tempgr, ci[1,], lty = "dotted")
lines(tempgr, ci[2,], lty = "dotted")

# occasion effect 
pred_occ <- predict(m, data.table(occasion = 1:nocc, temp = 18), site_data, nboot = 1000)
ci <- apply(pred_occ$pboot, 2, quantile, prob = c(0.025, 0.975))
plot(1:nocc, pred_occ$p, type = "b", pch = 19, lwd = 1.5, ylim = c(min(ci[1,]), max(ci[2,])), xlab = "Temperature", ylab = "Detection Probability")
lines(1:nocc, ci[1,], lty = "dotted")
lines(1:nocc, ci[2,], lty = "dotted")

# spatial effect 
xgr <- seq(-2, 2.5, 0.1)
ygr <- seq(-2, 2.5, 0.1)
gr <- expand.grid(xgr, ygr)
pred_xy <- predict(m, visit_data, data.table(occasion = 1, x = gr[,1], y = gr[,2], hab = "arable"), nboot = 1000)

ggplot() + 
  geom_tile(aes(x = gr[,1], y = gr[,2], fill = pred_xy$psi)) + 
  theme_bw() + 
  scale_x_continuous("x") + 
  scale_y_continuous("y") + 
  scale_fill_viridis_c("Occupancy")

# spatio-temporal effect 
xgr <- rep(gr[,1], nocc)
ygr <- rep(gr[,2], nocc)
tgr <- rep(1:nocc, each = nrow(gr))
pred_xyt <- predict(m, visit_data, data.table(occasion = tgr, x = xgr, y = ygr, hab = "arable"), nboot = 1000)

ggplot(data.frame(x = xgr, y = ygr, t = tgr, psi = pred_xyt$psi)) + 
  geom_tile(aes(x = x, y = y, group = t, fill = psi)) + 
  theme_bw() + 
  facet_wrap(~t) + 
  scale_x_continuous("x") + 
  scale_y_continuous("y") + 
  scale_fill_viridis_c("Occupancy")