This document explores how variation in sampling frequency and timing
may influence conclusions for trend analyses. We use seasonal metrics of
chlorophyll concentration as estimated from mixed-effects meta-analysis
analysis of GAM results as described in Beck et
al. 2022. Observational chlorophyll data are used for station 30 in
South Bay, as shown below in the chldat data object. All
source materials are here.
The primary question addressed is:
First, some setup material for the analysis:
library(tidyverse)
library(wqtrends)
library(patchwork)
library(lubridate)
library(colorspace)
library(here)
library(knitr)
opts_chunk$set(warning = FALSE, message = FALSE, echo = T)
load(file = here('data/datprc.RData'))
# chl data station 30
chldat <- datprc %>%
filter(station == 30) %>%
filter(param == 'chl')
# year groups
delim <- 3
delims <- unique(year(chldat$date)) %>%
as.numeric %>%
range %>%
{seq(.[1], .[2], length.out = delim + 1)} %>%
round(0)
# add temporal groupings to chldat
chldat <- chldat %>%
mutate(
doy = yday(date - 31),
yr = ifelse(mo == 'Jan', yr - 1, yr),
yrgrp = cut(yr, breaks = delims, include.lowest = T, right = F, dig.lab = 4),
qrt = quarter(date, fiscal_start = 2),
qrt = factor(qrt, levels = c('1', '2', '3', '4'), labels = c('FMA', 'MJJ', 'ASO', 'NDJ'))
) %>%
filter(yr > 1977)
This is the dataset that is evaluated. Note the addition of different
columns for temporal and seasonal groupings. The yr and
doy variables are offset by one month, where year starts on
February to more closely follow the water year.
head(chldat)
## date station param value doy cont_year yr mo yrgrp qrt
## 1 1978-02-16 30 chl 2.20 16 1978.126 1978 Feb [1978,1992) FMA
## 2 1978-03-14 30 chl 4.65 42 1978.197 1978 Mar [1978,1992) FMA
## 3 1978-03-20 30 chl 9.70 48 1978.214 1978 Mar [1978,1992) FMA
## 4 1978-04-05 30 chl 2.90 64 1978.258 1978 Apr [1978,1992) FMA
## 5 1978-05-24 30 chl 3.90 113 1978.392 1978 May [1978,1992) MJJ
## 6 1978-05-25 30 chl 2.10 114 1978.395 1978 May [1978,1992) MJJ
Here, it is shown how effort varied by month and seasonal groupings at station 30 for the approximate forty year period of record.
smpeff_plo <- function(datin){
toplo <- datin
p1 <- ggplot(toplo, aes(x = mo, group = yrgrp, fill = yrgrp)) +
geom_bar(position = position_dodge(), color = 'black', alpha = 0.7) +
theme_minimal() +
labs(
fill = 'Year group',
x = NULL ,
y = 'Observations'
)
p2 <- ggplot(toplo, aes(x = qrt, group = yrgrp, fill = yrgrp)) +
geom_bar(position = position_dodge(), color = 'black', alpha = 0.7) +
theme_minimal() +
labs(
fill = 'Year group',
x = NULL ,
y = 'Observations'
)
p <- p1 + p2 + plot_layout(ncol = 1, guides = 'collect')
return(p)
}
smpeff_plo(chldat)
It is clearly shown that sampling in the spring occurred more frequently earlier in the period of record, whereas more sampling in the fall occurred later in the period of record. Regardless of time period, more sampling occurs in the spring.
To evaluate an effect on trend conclusions, the observed time series can be modeled with a GAM and seasonal conclusions can be evaluated for different time periods. This can then be compared with different “down-sampled” versions of the original time series to test different scenarios. The two tested scenarios include:
The down-sampling for both scenarios is demonstrated below.
The minimum number of samples in each month, year group combination is identified across all groups. The down-sampled dataset takes a random sample for each month, year group combination that is equal in size to the minimum criteria. A function is made to use later. Note the use of a random seed, which becomes important later.
set.seed(123)
downsamp_fun1 <- function(datin){
tosmp <- datin %>%
group_by(mo, yrgrp) %>%
summarise(
cnt = n(),
.groups = 'drop'
) %>%
group_by(mo) %>%
filter(cnt == min(cnt)) %>%
ungroup() %>%
select(-yrgrp)
minmo <- min(tosmp$cnt)
out <- datin %>%
group_by(yrgrp, mo) %>%
sample_n(size = minmo) %>%
ungroup()
return(out)
}
chlsmp1 <- downsamp_fun1(chldat)
Now, evaluating the counts in each temporal category as before shows that the down-sampled time series has a more equal distribution of samples for each month.
smpeff_plo(chlsmp1)
This down-sampling scenario is similar as above, except the approximate sampling bias towards spring is more balanced between the year groups. The minimum number of samples for each month across year groups is used as the down-sampling criteria.
downsamp_fun2 <- function(datin){
tosmp <- datin %>%
group_by(mo, yrgrp) %>%
summarise(
cnt = n(),
.groups = 'drop'
) %>%
group_by(mo) %>%
filter(cnt == min(cnt)) %>%
ungroup() %>%
select(-yrgrp)
out <- datin %>%
left_join(tosmp, by = 'mo') %>%
group_by(yrgrp, mo) %>%
sample_n(unique(cnt))
return(out)
}
chlsmp2 <- downsamp_fun2(chldat)
smpeff_plo(chlsmp2)
Trends are now assessed for the original dataset and the two down-sampled datasets. The models are created and plotted first.
modorig <- anlz_gam(chldat, trans = 'log10')
modsmp1 <- anlz_gam(chlsmp1, trans = 'log10')
modsmp2 <- anlz_gam(chlsmp2, trans = 'log10')
Now plots:
show_prdseries(modorig, ylab = 'log-Chl-a') + labs(subtitle = '(a) Original data')
show_prdseries(modsmp1, ylab = 'log-Chl-a') + labs(subtitle = '(b) Down-sampled, equal month')
show_prdseries(modsmp2, ylab = 'log-Chl-a') + labs(subtitle = '(c) Down-sampled, equal year group')
We can test the trends in each quarter and year period for each model as follows.
cmptrnd_fun <- function(..., delims, doystr = c(1, 91, 182, 274), doyend = c(90, 181, 273, 364),
seas = c('FMA', 'MJJ', 'ASO', 'NDJ')){
delim <- length(delims) - 1
yrstr <- delims[1:delim]
yrend <- c(delims[2:delim] - 1, delims[delim + 1])
out <- c('modorig', 'modsmp1', 'modsmp2') %>%
crossing(
fl = .,
tibble(
doystr = doystr,
doyend = doyend
),
tibble(
yrstr = yrstr,
yrend = yrend
)
) %>%
group_by(fl) %>%
nest() %>%
mutate(
mod = purrr::map(fl, get)
) %>%
unnest(c('data')) %>%
group_by(doystr, doyend, mod, fl) %>%
nest() %>%
mutate(
avgseas = purrr::pmap(list(mod = mod, doystr = doystr, doyend = doyend), anlz_avgseason)
) %>%
unnest('data') %>%
mutate(
yrstr2 = purrr::pmap(list(yrstr, avgseas), function(yrstr, avgseas){
max(yrstr, min(avgseas$yr))
}),
yrend2 = purrr::pmap(list(yrend, avgseas), function(yrend, avgseas){
min(yrend, max(avgseas$yr))
}),
metatrnd = purrr::pmap(list(metseason = avgseas, yrstr = yrstr2, yrend = yrend2), anlz_mixmeta)
) %>%
ungroup %>%
select(-avgseas) %>%
mutate(
coefs = purrr::map(metatrnd, function(x){
summary(x)$coefficients
}),
yrcoef = purrr::map(coefs, function(x){
x[2, 1]
}),
yrcoefupr = purrr::pmap(list(coefs, yrcoef), function(coefs, yrcoef){
se <- coefs[2, 2]
yrcoef + 1.96 * se
}),
yrcoeflwr = purrr::pmap(list(coefs, yrcoef), function(coefs, yrcoef){
se <- coefs[2, 2]
yrcoef - 1.96 * se
}),
pval = purrr::map(metatrnd, function(x){
coefficients(summary(x)) %>% data.frame %>% .[2, 4]
})
) %>%
select(dataset = fl, seas = doystr, yrs = yrstr, yrcoef, yrcoefupr, yrcoeflwr, pval) %>%
mutate(
seas = factor(seas, labels = !!seas),
yrs = factor(yrs, levels = delims[1:delim], labels = levels(chldat$yrgrp)),
pval = ifelse(pval <= 0.05, 'p < 0.05', 'ns')
) %>%
unnest(c('yrcoef', 'yrcoefupr', 'yrcoeflwr')) %>%
mutate(dataset = factor(dataset, levels = c('modorig', 'modsmp1', 'modsmp2'), labels = c('original', 'equal month', 'equal year group')))
return(out)
}
cmptrnd <- cmptrnd_fun(modorig, modsmp1, modsmp2, delims = delims)
head(cmptrnd)
## # A tibble: 6 x 7
## dataset seas yrs yrcoef yrcoefupr yrcoeflwr pval
## <fct> <fct> <fct> <dbl> <dbl> <dbl> <chr>
## 1 original FMA [1978,1992) 0.0177 0.0466 -0.0112 ns
## 2 original FMA [1992,2005) 0.0379 0.0660 0.00975 p < 0.05
## 3 original FMA [2005,2019] -0.0239 -0.00423 -0.0436 p < 0.05
## 4 original MJJ [1978,1992) 0.00160 0.0299 -0.0267 ns
## 5 original MJJ [1992,2005) 0.0178 0.0320 0.00351 p < 0.05
## 6 original MJJ [2005,2019] -0.0203 -0.0103 -0.0302 p < 0.05
And plotted as such:
cmptrnd_plo <- function(datin){
toplo <- datin %>%
mutate(
seas = factor(seas,levels = rev(levels(seas)))
)
p <- ggplot(toplo, aes(y = seas, x = yrcoef, linetype = pval, shape = dataset, colour = dataset)) +
facet_grid(~ yrs, scales = 'free_x') +
geom_point(size = 2, position = position_dodge(width = 0.6), alpha = 0.8) +
geom_errorbarh(aes(xmin = yrcoeflwr, xmax = yrcoefupr), position = position_dodge(width = 0.6), height = 0) +
scale_linetype_manual('Trend significance', values = c('dashed', 'solid')) +
scale_colour_discrete_qualitative(name = 'Dataset', palette = 'Dark3') +
scale_shape(name = 'Dataset') +
geom_vline(xintercept = 0, linetype = 'dashed') +
# scale_x_continuous(limits = c(-1.14, 1.14)) +
guides(
color = guide_legend(override.aes = list(linetype = 0))
) +
theme_minimal() +
theme(
strip.background = element_blank(),
panel.background = element_rect(fill = 'gray98', colour = NA),
panel.border = element_blank(),
legend.position = 'top'
) +
labs(
y = 'Season',
x = expression(paste(log[10], " chl-a change (", italic(mu), "g ", L^-1, yr^-1, ")"))
)
return(p)
}
cmptrnd_plo(cmptrnd)
The results show that trend estimates are similar, but can vary quite a bit depending on sample frequency.
Because the random sample varies for each down-sampled subset, the analysis is repeated. If the random sample has no effect, the results in the below plots should look similar to the previous plots.
The dataset is down-sampled using a different random seed and the models are created again.
set.seed(456)
chlsmp1 <- downsamp_fun1(chldat)
chlsmp2 <- downsamp_fun2(chldat)
modsmp1 <- anlz_gam(chlsmp1, trans = 'log10')
modsmp2 <- anlz_gam(chlsmp2, trans = 'log10')
Model predictions are shown again:
show_prdseries(modorig, ylab = 'log-Chl-a') + labs(subtitle = '(a) Original data')
show_prdseries(modsmp1, ylab = 'log-Chl-a') + labs(subtitle = '(b) Down-sampled, equal month')
show_prdseries(modsmp2, ylab = 'log-Chl-a') + labs(subtitle = '(c) Down-sampled, equal year group')
The quarterly trends are compared.
cmptrnd <- cmptrnd_fun(modorig, modsmp1, modsmp2, delims = delims)
cmptrnd_plo(cmptrnd)
The results are similar as above, with a few differences.
The above was repeated for the February to April period using five random down-sampled estimates and extracting the estimate of the maximum chlorophyll peak. This tests the hypothesis that less frequent sampling earlier in the period of record may not capture high chlorophyll events and that should be reflected in lower GAM estimates in the down-sampled data. The down-sampling assumes equal month.
# random sample label
rnds <- paste('Equal month', 1:5)
# get max seasonal estimates from random down-sampled data
ests <- tibble(chlsmp = rnds) %>%
group_by(chlsmp) %>%
nest() %>%
mutate(
data = list(chldat),
data = purrr::map(data, downsamp_fun1),
mods = purrr::map(data, anlz_gam, trans = 'log10'),
vals = purrr::map(mods, anlz_metseason, metfun = max, doystr = 1, doyend = 89, nsim = 1e4, na.rm = T)
) %>%
select(chlsmp, vals) %>%
unnest('vals')
# get max seasonal estimates from full dataset
origests <- anlz_metseason(modorig, metfun = max, doystr = 1, doyend = 89, nsim = 1e4, na.rm = T) %>%
mutate(chlsmp = 'Full dataset')
# combine all results
allests <- bind_rows(ests, origests) %>%
mutate(
col = case_when(
chlsmp == 'Full dataset' ~ 'Full dataset',
T ~ 'Down-sampled'
),
col = factor(col, levels = c('Full dataset', 'Down-sampled')),
chlsmp = factor(chlsmp, levels = c('Full dataset', rnds))
)
head(allests)
Now make the plot for comparison.
wd <- 0.5
ggplot(data = allests, aes(x = yr, y = bt_met, color = col, group = chlsmp)) +
geom_point(position = position_dodge(width = wd)) +
geom_errorbar(aes(ymin = bt_lwr, ymax = bt_upr), position = position_dodge(width = wd), width = 0) +
scale_y_log10() +
scale_color_manual(values = c('tomato1', 'black')) +
theme_bw() +
theme(
axis.title.x = element_blank(),
legend.position = 'top'
) +
labs(
y = 'Chlorophyll-a (ug/L)',
color = NULL
)
The above was repeated using 1000 random samples, then taking the average of the seasonal metrics for each random sample in each year. The average of the random samples was estimated mixed-effects meta-analysis to account for the uncertainty of each seasonal metric for each random sample. The mixed-effects meta-analysis model was used to calculate an intercept only model.
The random samples and their seasonal metrics take about 3 hours to run, but the code is shown here for reproducibility.
rnds <- paste('Equal month', 1:1000)
# get max seasonal estimates from random down-sampled data
str <- Sys.time()
smpests <- tibble(chlsmp = rnds) %>%
group_by(chlsmp) %>%
nest() %>%
mutate(
data = list(chldat),
data = purrr::map(data, downsamp_fun1),
mods = purrr::map(data, anlz_gam, trans = 'log10'),
vals = purrr::map(mods, anlz_metseason, metfun = max, doystr = 1, doyend = 89, nsim = 1e4, na.rm = T)
) %>%
select(chlsmp, vals) %>%
unnest('vals')
print(Sys.time() - str)
save(smpests, file = here('data/smpests.RData'))
load(file = here('data/smpests.RData'))
cmb <- smpests %>%
group_by(yr) %>%
nest() %>%
mutate(
mixmet = purrr::map(data, function(x) mixmeta::mixmeta(met ~ 1, S = se, random = ~1|chlsmp, data = x, method = 'reml')),
met = purrr::map(mixmet, function(x) summary(x)$coefficients[1]),
se = purrr::map(mixmet, function(x) summary(x)$coefficients[2])
) %>%
select(yr, met, se) %>%
unnest(cols = c('met', 'se')) %>%
mutate(
bt_lwr = 10^(met - 1.96 * se),
bt_upr = 10^(met + 1.96 * se),
bt_met = 10^met,
chlsmp = '1000 sim avg'
)
# combine all results
allests <- bind_rows(cmb, origests) %>%
mutate(chlsmp = factor(chlsmp, levels = c('Full dataset', '1000 sim avg')))
wd <- 0.5
ggplot(data = allests, aes(x = yr, y = bt_met, color = chlsmp, group = chlsmp)) +
geom_point(position = position_dodge(width = wd)) +
geom_errorbar(aes(ymin = bt_lwr, ymax = bt_upr), position = position_dodge(width = wd), width = 0) +
scale_y_log10() +
scale_color_manual(values = c('tomato1', 'black')) +
theme_bw() +
theme(
axis.title.x = element_blank(),
legend.position = 'top'
) +
labs(
y = 'Chlorophyll-a (ug/L)',
color = NULL
)