Exploratory analysis of sampling effect on trend estimates in South Bay

Setup and background

This document explores how variation in sampling frequency and timing may influence conclusions for trend analyses. We use seasonal metrics of chlorophyll concentration as estimated from mixed-effects meta-analysis analysis of GAM results as described in Beck et al. 2022. Observational chlorophyll data are used for station 30 in South Bay, as shown below in the chldat data object. All source materials are here.

The primary question addressed is:

How does variation in sample effort affect conclusions on trends, given that effort varied by month and season for the duration of the time series?

First, some setup material for the analysis:

library(tidyverse)
library(wqtrends)
library(patchwork)
library(lubridate)
library(colorspace)
library(here)
library(knitr)

opts_chunk$set(warning = FALSE, message = FALSE, echo = T)

load(file = here('data/datprc.RData'))

# chl data station 30
chldat <- datprc %>% 
  filter(station == 30) %>% 
  filter(param == 'chl')

# year groups
delim <- 3
delims <- unique(year(chldat$date)) %>% 
  as.numeric %>% 
  range %>% 
  {seq(.[1], .[2], length.out = delim + 1)} %>% 
  round(0)

# add temporal groupings to chldat 
chldat <- chldat %>% 
  mutate(
    doy = yday(date - 31),
    yr = ifelse(mo == 'Jan', yr - 1, yr),
    yrgrp = cut(yr, breaks = delims, include.lowest = T, right = F, dig.lab = 4), 
    qrt = quarter(date, fiscal_start = 2),
    qrt = factor(qrt, levels = c('1', '2', '3', '4'), labels = c('FMA', 'MJJ', 'ASO', 'NDJ'))
  ) %>% 
  filter(yr > 1977)

This is the dataset that is evaluated. Note the addition of different columns for temporal and seasonal groupings. The yr and doy variables are offset by one month, where year starts on February to more closely follow the water year.

head(chldat)

##         date station param value doy cont_year   yr  mo       yrgrp qrt
## 1 1978-02-16      30   chl  2.20  16  1978.126 1978 Feb [1978,1992) FMA
## 2 1978-03-14      30   chl  4.65  42  1978.197 1978 Mar [1978,1992) FMA
## 3 1978-03-20      30   chl  9.70  48  1978.214 1978 Mar [1978,1992) FMA
## 4 1978-04-05      30   chl  2.90  64  1978.258 1978 Apr [1978,1992) FMA
## 5 1978-05-24      30   chl  3.90 113  1978.392 1978 May [1978,1992) MJJ
## 6 1978-05-25      30   chl  2.10 114  1978.395 1978 May [1978,1992) MJJ

Down-sampling methods

Here, it is shown how effort varied by month and seasonal groupings at station 30 for the approximate forty year period of record.

smpeff_plo <- function(datin){

  toplo <- datin 
  
  p1 <- ggplot(toplo, aes(x = mo, group = yrgrp, fill = yrgrp)) + 
    geom_bar(position = position_dodge(), color = 'black', alpha = 0.7) + 
    theme_minimal() + 
    labs(
      fill = 'Year group', 
      x = NULL ,
      y = 'Observations'
    )
  
  p2 <- ggplot(toplo, aes(x = qrt, group = yrgrp, fill = yrgrp)) + 
    geom_bar(position = position_dodge(), color = 'black', alpha = 0.7) + 
    theme_minimal() + 
    labs(
      fill = 'Year group', 
      x = NULL ,
      y = 'Observations'
    )
  
  p <- p1 + p2 + plot_layout(ncol = 1, guides = 'collect')
  
  return(p)
  
}

smpeff_plo(chldat)

It is clearly shown that sampling in the spring occurred more frequently earlier in the period of record, whereas more sampling in the fall occurred later in the period of record. Regardless of time period, more sampling occurs in the spring.

To evaluate an effect on trend conclusions, the observed time series can be modeled with a GAM and seasonal conclusions can be evaluated for different time periods. This can then be compared with different “down-sampled” versions of the original time series to test different scenarios. The two tested scenarios include:

Equal sampling between months
Equal sampling between year groups

The down-sampling for both scenarios is demonstrated below.

Scenario 1: equal sampling between months

The minimum number of samples in each month, year group combination is identified across all groups. The down-sampled dataset takes a random sample for each month, year group combination that is equal in size to the minimum criteria. A function is made to use later. Note the use of a random seed, which becomes important later.

set.seed(123)
downsamp_fun1 <- function(datin){
  
  tosmp <- datin %>% 
    group_by(mo, yrgrp) %>% 
    summarise(
      cnt = n(), 
      .groups = 'drop'
    ) %>% 
    group_by(mo) %>% 
    filter(cnt == min(cnt)) %>% 
    ungroup() %>% 
    select(-yrgrp)
  
  minmo <- min(tosmp$cnt)
  
  out <- datin %>%
    group_by(yrgrp, mo) %>% 
    sample_n(size = minmo) %>% 
    ungroup()
  
  return(out)  
  
}

chlsmp1 <- downsamp_fun1(chldat)

Now, evaluating the counts in each temporal category as before shows that the down-sampled time series has a more equal distribution of samples for each month.

smpeff_plo(chlsmp1)

Scenario 2: equal sampling across year groups

This down-sampling scenario is similar as above, except the approximate sampling bias towards spring is more balanced between the year groups. The minimum number of samples for each month across year groups is used as the down-sampling criteria.

downsamp_fun2 <- function(datin){

  tosmp <- datin %>% 
    group_by(mo, yrgrp) %>% 
    summarise(
      cnt = n(), 
      .groups = 'drop'
    ) %>% 
    group_by(mo) %>% 
    filter(cnt == min(cnt)) %>% 
    ungroup() %>% 
    select(-yrgrp)
  
  out <- datin %>%
    left_join(tosmp, by = 'mo') %>% 
    group_by(yrgrp, mo) %>% 
    sample_n(unique(cnt))  
  
  return(out)
  
}

chlsmp2 <- downsamp_fun2(chldat)

smpeff_plo(chlsmp2)

Trends on original and down-sampled

Trends are now assessed for the original dataset and the two down-sampled datasets. The models are created and plotted first.

modorig <- anlz_gam(chldat, trans = 'log10')
modsmp1 <- anlz_gam(chlsmp1, trans = 'log10')
modsmp2 <- anlz_gam(chlsmp2, trans = 'log10')

Now plots:

show_prdseries(modorig, ylab = 'log-Chl-a') + labs(subtitle = '(a) Original data')

show_prdseries(modsmp1, ylab = 'log-Chl-a') + labs(subtitle = '(b) Down-sampled, equal month')

show_prdseries(modsmp2, ylab = 'log-Chl-a') + labs(subtitle = '(c) Down-sampled, equal year group')

We can test the trends in each quarter and year period for each model as follows.

cmptrnd_fun <- function(..., delims, doystr = c(1, 91, 182, 274), doyend = c(90, 181, 273, 364), 
                        seas = c('FMA', 'MJJ', 'ASO', 'NDJ')){

  delim <- length(delims) - 1
  yrstr <- delims[1:delim]
  yrend <- c(delims[2:delim] - 1, delims[delim + 1])
  
  out <- c('modorig', 'modsmp1', 'modsmp2') %>% 
    crossing(
      fl = ., 
      tibble(
        doystr = doystr,
        doyend = doyend
      ),
      tibble(
        yrstr = yrstr,
        yrend = yrend
      )
    ) %>% 
    group_by(fl) %>% 
    nest() %>% 
    mutate(
      mod = purrr::map(fl, get)
    ) %>% 
    unnest(c('data')) %>% 
    group_by(doystr, doyend, mod, fl) %>% 
    nest() %>% 
    mutate(
      avgseas = purrr::pmap(list(mod = mod, doystr = doystr, doyend = doyend), anlz_avgseason)
    ) %>% 
    unnest('data') %>% 
    mutate(
      yrstr2 = purrr::pmap(list(yrstr, avgseas), function(yrstr, avgseas){
        max(yrstr, min(avgseas$yr))
      }),
      yrend2 = purrr::pmap(list(yrend, avgseas), function(yrend, avgseas){
        min(yrend, max(avgseas$yr))
      }),
      metatrnd = purrr::pmap(list(metseason = avgseas, yrstr = yrstr2, yrend = yrend2), anlz_mixmeta)
    ) %>% 
    ungroup %>% 
    select(-avgseas) %>% 
    mutate(
      coefs = purrr::map(metatrnd, function(x){
        summary(x)$coefficients
      }),
      yrcoef = purrr::map(coefs, function(x){
        x[2, 1]
      }),
      yrcoefupr = purrr::pmap(list(coefs, yrcoef), function(coefs, yrcoef){
        se <- coefs[2, 2]
        yrcoef + 1.96 * se
      }),
      yrcoeflwr = purrr::pmap(list(coefs, yrcoef), function(coefs, yrcoef){
        se <- coefs[2, 2]
        yrcoef - 1.96 * se
      }),
      pval = purrr::map(metatrnd, function(x){
        coefficients(summary(x)) %>% data.frame %>% .[2, 4]
      })
    ) %>% 
    select(dataset = fl, seas = doystr, yrs = yrstr, yrcoef, yrcoefupr, yrcoeflwr, pval) %>% 
    mutate(
      seas = factor(seas, labels = !!seas),
      yrs = factor(yrs, levels = delims[1:delim], labels = levels(chldat$yrgrp)),
      pval = ifelse(pval <= 0.05, 'p < 0.05', 'ns')
    ) %>% 
    unnest(c('yrcoef', 'yrcoefupr', 'yrcoeflwr')) %>% 
    mutate(dataset = factor(dataset, levels = c('modorig', 'modsmp1', 'modsmp2'), labels = c('original', 'equal month', 'equal year group')))

  return(out)

}

cmptrnd <- cmptrnd_fun(modorig, modsmp1, modsmp2, delims = delims)
head(cmptrnd)

## # A tibble: 6 x 7
##   dataset  seas  yrs           yrcoef yrcoefupr yrcoeflwr pval    
##   <fct>    <fct> <fct>          <dbl>     <dbl>     <dbl> <chr>   
## 1 original FMA   [1978,1992)  0.0177    0.0466   -0.0112  ns      
## 2 original FMA   [1992,2005)  0.0379    0.0660    0.00975 p < 0.05
## 3 original FMA   [2005,2019] -0.0239   -0.00423  -0.0436  p < 0.05
## 4 original MJJ   [1978,1992)  0.00160   0.0299   -0.0267  ns      
## 5 original MJJ   [1992,2005)  0.0178    0.0320    0.00351 p < 0.05
## 6 original MJJ   [2005,2019] -0.0203   -0.0103   -0.0302  p < 0.05

And plotted as such:

cmptrnd_plo <- function(datin){
  
  toplo <- datin %>% 
    mutate(
      seas = factor(seas,levels = rev(levels(seas)))
    )
  
  p <- ggplot(toplo, aes(y = seas, x = yrcoef, linetype = pval, shape = dataset, colour = dataset)) + 
    facet_grid(~ yrs, scales = 'free_x') + 
    geom_point(size = 2, position = position_dodge(width = 0.6), alpha = 0.8) + 
    geom_errorbarh(aes(xmin = yrcoeflwr, xmax = yrcoefupr), position = position_dodge(width = 0.6), height = 0) +
    scale_linetype_manual('Trend significance', values = c('dashed', 'solid')) + 
    scale_colour_discrete_qualitative(name = 'Dataset', palette = 'Dark3') +
    scale_shape(name = 'Dataset') + 
    geom_vline(xintercept = 0, linetype = 'dashed') + 
    # scale_x_continuous(limits = c(-1.14, 1.14)) +
    guides(
      color = guide_legend(override.aes = list(linetype = 0))
    ) +
    theme_minimal() + 
    theme(
      strip.background = element_blank(),
      panel.background = element_rect(fill = 'gray98', colour = NA), 
      panel.border = element_blank(), 
      legend.position = 'top'
    ) +
    labs(
      y = 'Season', 
      x = expression(paste(log[10], " chl-a change (", italic(mu), "g ", L^-1, yr^-1, ")"))
    )
  
  return(p)
  
}

cmptrnd_plo(cmptrnd)

The results show that trend estimates are similar, but can vary quite a bit depending on sample frequency.

Trends (again) on original and down-sampled

Because the random sample varies for each down-sampled subset, the analysis is repeated. If the random sample has no effect, the results in the below plots should look similar to the previous plots.

The dataset is down-sampled using a different random seed and the models are created again.

set.seed(456)
chlsmp1 <- downsamp_fun1(chldat)
chlsmp2 <- downsamp_fun2(chldat)
modsmp1 <- anlz_gam(chlsmp1, trans = 'log10')
modsmp2 <- anlz_gam(chlsmp2, trans = 'log10')

Model predictions are shown again:

show_prdseries(modorig, ylab = 'log-Chl-a') + labs(subtitle = '(a) Original data')

show_prdseries(modsmp1, ylab = 'log-Chl-a') + labs(subtitle = '(b) Down-sampled, equal month')

show_prdseries(modsmp2, ylab = 'log-Chl-a') + labs(subtitle = '(c) Down-sampled, equal year group')

The quarterly trends are compared.

cmptrnd <- cmptrnd_fun(modorig, modsmp1, modsmp2, delims = delims)
cmptrnd_plo(cmptrnd)

The results are similar as above, with a few differences.

Evaluate effect on maximum peak estimates

The above was repeated for the February to April period using five random down-sampled estimates and extracting the estimate of the maximum chlorophyll peak. This tests the hypothesis that less frequent sampling earlier in the period of record may not capture high chlorophyll events and that should be reflected in lower GAM estimates in the down-sampled data. The down-sampling assumes equal month.

# random sample label
rnds <- paste('Equal month', 1:5)

# get max seasonal estimates from random down-sampled data
ests <- tibble(chlsmp = rnds) %>% 
  group_by(chlsmp) %>% 
  nest() %>% 
  mutate(
    data = list(chldat), 
    data = purrr::map(data, downsamp_fun1), 
    mods = purrr::map(data, anlz_gam, trans = 'log10'),
    vals = purrr::map(mods, anlz_metseason, metfun = max, doystr = 1, doyend = 89, nsim = 1e4, na.rm = T)
  ) %>% 
  select(chlsmp, vals) %>% 
  unnest('vals')

# get max seasonal estimates from full dataset
origests <- anlz_metseason(modorig, metfun = max, doystr = 1, doyend = 89, nsim = 1e4, na.rm = T) %>% 
  mutate(chlsmp = 'Full dataset')

# combine all results
allests <- bind_rows(ests, origests) %>% 
  mutate(
    col = case_when(
      chlsmp == 'Full dataset' ~ 'Full dataset', 
      T ~ 'Down-sampled'
    ), 
    col = factor(col, levels = c('Full dataset', 'Down-sampled')),
    chlsmp = factor(chlsmp, levels = c('Full dataset', rnds))
  )

head(allests)

Now make the plot for comparison.

wd <- 0.5
ggplot(data = allests, aes(x = yr, y = bt_met, color = col, group = chlsmp)) + 
  geom_point(position = position_dodge(width = wd)) +
  geom_errorbar(aes(ymin = bt_lwr, ymax = bt_upr), position = position_dodge(width = wd), width = 0) +
  scale_y_log10() +
  scale_color_manual(values = c('tomato1', 'black')) +
  theme_bw() + 
  theme(
    axis.title.x = element_blank(), 
    legend.position = 'top'
  ) + 
  labs(
    y = 'Chlorophyll-a (ug/L)', 
    color = NULL
  )

The above was repeated using 1000 random samples, then taking the average of the seasonal metrics for each random sample in each year. The average of the random samples was estimated mixed-effects meta-analysis to account for the uncertainty of each seasonal metric for each random sample. The mixed-effects meta-analysis model was used to calculate an intercept only model.

The random samples and their seasonal metrics take about 3 hours to run, but the code is shown here for reproducibility.

rnds <- paste('Equal month', 1:1000)

# get max seasonal estimates from random down-sampled data
str <- Sys.time()
smpests <- tibble(chlsmp = rnds) %>% 
  group_by(chlsmp) %>% 
  nest() %>% 
  mutate(
    data = list(chldat), 
    data = purrr::map(data, downsamp_fun1), 
    mods = purrr::map(data, anlz_gam, trans = 'log10'),
    vals = purrr::map(mods, anlz_metseason, metfun = max, doystr = 1, doyend = 89, nsim = 1e4, na.rm = T)
  ) %>% 
  select(chlsmp, vals) %>% 
  unnest('vals')
print(Sys.time() - str)

save(smpests, file = here('data/smpests.RData'))

load(file = here('data/smpests.RData'))

cmb <- smpests %>%
  group_by(yr) %>%
  nest() %>% 
  mutate(
    mixmet = purrr::map(data, function(x) mixmeta::mixmeta(met ~ 1, S = se, random = ~1|chlsmp, data = x, method = 'reml')), 
    met = purrr::map(mixmet, function(x) summary(x)$coefficients[1]),
    se = purrr::map(mixmet, function(x) summary(x)$coefficients[2])
  ) %>% 
  select(yr, met, se) %>% 
  unnest(cols = c('met', 'se')) %>% 
  mutate(
    bt_lwr = 10^(met - 1.96 * se),
    bt_upr = 10^(met + 1.96 * se),
    bt_met = 10^met,
    chlsmp = '1000 sim avg'
  )
  
# combine all results
allests <- bind_rows(cmb, origests) %>% 
  mutate(chlsmp = factor(chlsmp, levels = c('Full dataset', '1000 sim avg')))

wd <- 0.5
ggplot(data = allests, aes(x = yr, y = bt_met, color = chlsmp, group = chlsmp)) + 
  geom_point(position = position_dodge(width = wd)) +
  geom_errorbar(aes(ymin = bt_lwr, ymax = bt_upr), position = position_dodge(width = wd), width = 0) +
  scale_y_log10() +
  scale_color_manual(values = c('tomato1', 'black')) +
  theme_bw() + 
  theme(
    axis.title.x = element_blank(), 
    legend.position = 'top'
  ) + 
  labs(
    y = 'Chlorophyll-a (ug/L)', 
    color = NULL
  )