Assigned Reading:

Chapters 3 and 4 of Powell, L. A., & Gale, G. A. (2015). Estimation of parameters for animal populations. Lincoln: Caught Napping Publications.
Anderson, D. R., & Burnham, K. P. (2002). Avoiding pitfalls when using information-theoretic methods. The Journal of wildlife management, 912-918.

Overview

For our model selection lab, we will be using data from the lterdatasampler R package. Specifically, we will be using the pie_crab dataset, which includes records of “Fiddler crab body size in salt marshes from Florida to Massachusetts, USA at PIE and VCR LTER and NOAA NERR sites during summer 2016.” Per the dataset description:

“We collected ~30 male, adult Minuca pugnax from thirteen marshes from Florida to Massachusetts and measured their carapace width with calipers. Water and air temperature data were collected from monitoring programs (i.e., LTER, NERR sites), nearby weather stations, and ocean buoys for 2016.”

Our goal is to test multiple hypotheses concerning drivers of fiddler crab body size. We will do so using the model selection/multi-model inference framework. Let’s get started!

Model selection example

Data exploration

# List of packages necessary to run this script:
require(librarian, quietly = TRUE)
shelf(tidyverse, cowplot, performance, 
      AICcmodavg, # For model selection, model averaging
      lterdatasampler, # For LTER data
      lib = tempdir(),
      quiet = TRUE)

# NOTE: there are other packages for model selection/model averaging work
# (e.g., MuMIn) that have strengths and weaknesses relative to AICcmodavg. Feel
# free to check them out!

# Read in a data file
data("pie_crab")

As always, take some time to explore the data. No need to do every data exploration step we learned from our data exploration lab, but I suggest at least familiarizing yourself with the response variable (i.e., pie_crab$size) and the covariates. Also, read through the variable descriptions on the pie_crab dataset page.

Create hypotheses

As we learned from the lecture, the model selection / multi-model inference framework is built off of a “strong inference” philosophy. That is, we should be testing multiple hypotheses simultaneously. Given our goal is to determine drivers of fiddler crab body size, we need to generate hypotheses that explain fiddler crab body size using the collected covariates. For instance:

$H_a1$ : Fiddler crab size is best explained by a quadratic relationship with latitude, by variation in water and air temperature, and by the sampling location. This will be our most complicated model.
$H_0$ : Fiddler crab size is best explained the null model (i.e., none of our covariates are useful).

NOTE: It’s worth saying that it’s best to generate these hypotheses (at least loosely) before one collects any data!

# Square the latitude column to create a quadratic term.
pie_crab$latitude_2 <- pie_crab$latitude^2

# Create a named list of models
mods <- 
  list(global = "size ~ latitude + latitude_2 + water_temp_sd +  air_temp_sd + name",
       null = "size ~ 1")

# Fit the models
fits <-
  lapply(mods, 
         glm, family = gaussian(link = "log"), data = pie_crab)

# # Get model fit summaries
# lapply(fits, summary)

Carefully look at the code chunk above. You will see I’m doing two things:

I am creating a named list object with models. Lists are a critical tool and concept in R (and many other coding languages), and they are super helpful for model selection. Check out ?list if you need more info.
I used the lapply() function to apply the glm() function to all models in the “mods” list. The lapply() function is part of the “apply family” of functions, and they are also very useful for model selection work. Check out ?lapply for more information.

I also commented out the lapply(fits, summary) call, but you’re welcome to run it if you wish!

Refining the “global” model

Before we move on, we need to do a very important step: refine the “global” model by checking for collinearity. The global model is the most complicated model in our candidate model set. If we reduce collinearity to our desired threshold (e.g., VIF < 5) in the global model, all simpler models should have acceptable levels of collinearity.

NOTE: There are many procedures for doing model selection, and creating a global model to start out is just one of them. For example, you could just write out hypotheses and create models without a global model, or you could create a balanced model set (e.g., same number of parameters in each model) for model averaging purposes. The advantage of creating a global model and a null model is that you create “upper and lower bounds” of complexity to compare simpler models to. However, your ultimate choice should be determined a priori and be based on your research questions and goals.

That said, let’s check for collinearity in our global model. On your own, systematically refine the global model until all VIFs are less than 5.

Pausing while you refine… Don’t look ahead and spoil the fun!

…

Now that you’ve got a refined model, overwrite the original global in our model set list and rerun the models:

# Overwrite original global model with refined model. Notice how I'm just
# indexing the list we already created to overwrite the global.
mods[["global"]] <- "size ~ latitude + water_temp_sd + air_temp_sd"

Creating more hypotheses

On your own, write out 2 - 3 additional hypotheses, using the covariates in the ‘pie_crab’ dataset. Yes, literally write or type them out in words somewhere. For the sake of this example, use only the covariates in the refined global model, and don’t add any additional polynomials or interactions.

Pausing while you write…

Okay, now that you have your written hypotheses, convert those to models in R syntax and fit them! Add your models to the code below.

# Create new models for your hypotheses. Notice how you can simply create a new
# named element in your existing list. For instance, I bet you've created a
# simple 'latitude' model. Add the other models you've created on your own.
mods[["latitude"]] <- "size ~ latitude"

# And then fit the models again
fits <-
  lapply(mods, 
         glm, family = gaussian(link = "log"), data = pie_crab)

# # Get model fit summaries
# lapply(fits, summary)

Rank your models with AICc

Now comes the part we’ve been waiting for! Create an AICc table to rank the models in your candidate pool. Run the code below, and then on your own, interpret the AICc table relative to your hypotheses IN WORDS. Pay careful attention to the delta AICc and cumulative weight columns.

# Create AICc ranking table
modTab <- aictab(fits,
                 second.ord = TRUE # This tells function to calculate *AICc*
                 )

# # Print model fit summaries
# modTab

Model averaging

And finally, if our a priori goal was model averaging, we can now do so. Below, we’re using conditional model averaging for all covariates. On your own, interpret the model averaged coefficients and standard errors.

# Model averaged coefficients for all covariates.
modavg_coefs <- 
  lapply(c("latitude", "air_temp_sd", "water_temp_sd"),
         function(X) {
           out <- modavg(cand.set = fits,
                         parm = X,
                         second.ord = TRUE)
           return(out)
         })

# # Print model averaged coefficients
# modavg_coefs

Discussion questions

How different are the model averaged coefficients (and standard errors) from the coefficients in the top-ranked model (i.e., the model with the smallest AICc value)? In your other models?
Go back and create a different global model. For instance, add the ‘water_temp’ and ‘air_temp’ covariates. Now, go back through all our steps. Does this change our model rankings in the AICc table? Does this change the model averaged coefficient estimates? Does it change our conclusions?!
Given the results from question #2, how do you feel about this whole model selection/multi-model inference framework? What do you like? What are your misgivings?

Model Selection