Contents

Overview

Data Processing

Linear Models

Generalized Additive Models (GAMs)

References


Overview

The current management paradigm for Old Tampa Bay (OTB) holds that (1) reducing external nitrogen loads will reduce chlorophyll a concentrations in the OTB water column, (2) reducing chlorophyll a concentrations will improve water clarity, and (3) improved water clarity will stimulate seagrass growth/recovery. The management paradigm is based on earlier studies that reported significant correlations between each subsequent pair of covariates, considering data from across Tampa Bay as a whole and/or from Tampa Bay segments including OTB (e.g., Janicki & Wade 1996; Morrison et al. 1996; Greening & Janicki 2016).

This document explores statistical relationships between TN loads and chlorophyll-a concentrations and between chlorophyll-a concentrations and Secchi depth at various OTB sub-segments using monthly averaged data between approximately 2000 and 2023 (periods of record vary by data source and variable). The analyses were implemented in R version 4.3.1 (R Core Team 2023).

The results of linear models indicate correlations at each OTB sub-segment:

The results of generalized additive models (GAMs) indicated that the TN-chlorophyll and chlorophyll-Secchi relationships varied substantially over time at some OTB sub-segments. We conclude that correlation analyses and simple regression models do not reliably capture the potentially synergistic effects of multiple stressors influencing water quality and seagrass abundance over time.

top


Data Processing

Water quality and external loading data were assembled by the ../R/data_proc.R script. This section processes these data to support the subsequent analyses, including a few basic QA/QC checks.

We begin by loading packages (see References).

knitr::opts_chunk$set(echo=TRUE, message=FALSE, warning=FALSE)
library(lubridate)
library(plyr)
library(dplyr)
library(tidyr)
library(mgcv)

Environmental Protection Commission of Hillsborough County (EPC) water quality data

This section loads the EPC water quality data from file (../data/epcwq.RData), pivots the dataframe from wide to long format, performs QC checks, aggregates the data to a monthly timeframe (monthly means), and creates a new epcwq2 dataframe for analysis. As indicated in output below, a substantial number of Secchi records indicate that reported Secchi depth equals the total depth (i.e., the Secchi depth was visible at bottom); in these cases, the Secchi depth is not necessarily a reliable proxy for water clarity (e.g., in shallow areas).

  # load data
    load( here::here("data/epcwq.RData" ))
  # dates
    epcwq1 <- epcwq |> as.data.frame()
    epcwq1$date <- epcwq1$SampleTime |> as.Date("%Y-%m-%d")
  # subset columns (data and 'Q' columns must be listed in consecutive pairs!)
    epcwq1 <- select( epcwq1, date, StationNumber, SampleDepth,
                              Total_Nitrogen, Total_NitrogenQ,
                              Chlorophyll_a, Chlorophyll_aQ,
                              SecchiDepth, Secchi_Q,
                              Total_Suspended_Solids, Total_Suspended_SolidsQ,
                              Turbidity, TurbidityQ )
  # subset OTB stations
    epcwq1 <- epcwq1[ which( epcwq1$StationNumber %in% c(36,  38,  40,  41,  42,  46,  47,
                                                         50,  51,  60,  61,  62,  63,  64,
                                                         65,  66,  67,  68) ), ]
  # subset by date
    epcwq1 <- epcwq1[ which( year(epcwq1$date) >= 2000 ), ]
  # loop through data and QC columns to pivot from wide to long format
    datacols <- seq( 4, 12, 2 )
    epcwq2 <- data.frame( matrix(ncol=7,nrow=0) )
    for( i in datacols ){
      temp <- epcwq1[ ,1:3 ]
      temp$Analyte <- colnames(epcwq1)[i]
      temp$value <- epcwq1[,i] |> as.numeric()
      temp$unit <- NA
      temp$QA <- epcwq1[,i+1]
      temp <- temp[ complete.cases( temp$value ), ]
      epcwq2 <- rbind( epcwq2, temp )
    }
  # standardize parameter names
    param.names.old <- epcwq2$Analyte |> unique() |> sort()
    param.names.new <- c( "Chla", "Secchi", "TN", "TSS", "Turbidity" )
    epcwq2$param <- mapvalues( epcwq2$Analyte, param.names.old, param.names.new )
  # units
    units <- c( "ug/l", "m", "mg/l", "mg/l", "NTU" )  # listed in same order as param.names.new, above
    epcwq2$unit <- mapvalues( epcwq2$param, param.names.new, units )
  # qualifier codes
    # replace "NULL" with NA
    epcwq2$QA[ which(epcwq2$QA=="NULL") ] <- NA
    # specify codes
    fatal.codes <- c("*","?","A","B","G","H","J","K",
                     "L","N","O","Q","T","V","Y","Z")
    # define function to locate fatal records
    find.fatal <- function( QUALIFIER, FATAL=fatal.codes ){  # QUALIFER arg is a string
      input.code <- QUALIFIER |> strsplit(split='') |>
        unlist() |> toupper()  # parse string into single characters
      fatal <- input.code %in% FATAL |> any()  # check if any characters match FATAL
      return( fatal )  # return TRUE or FALSE
    }  # // end find.fatal()
    # apply function to locate fatal records
    rm.fatal.idx <- apply( matrix(epcwq2$QA,ncol=1), 1, find.fatal ) |> which()
    epcwq2 <- epcwq2[ -rm.fatal.idx, ]
    # acknowledge Secchi records visible on bottom
    epcwq2[ which( epcwq2$param=="Secchi" & epcwq2$QA==">"), ] |> nrow() # number of VOB Secchi records
## [1] 1333
    epcwq2[ which( epcwq2$param=="Secchi"), ] |> nrow()  # total number of Secchi records
## [1] 5092
  # coerce depth to numeric
    epcwq2$SampleDepth <- epcwq2$SampleDepth |> as.numeric()
  # check for duplicates (confirm FALSE)
    epcwq2 |> duplicated() |> any()
## [1] FALSE
  # subset and rename columns
    epcwq2 <- epcwq2[, c("date","param","value","unit","StationNumber") ]
    colnames( epcwq2 ) <- c("date","param", "value", "unit", "site" )
  # aggregate data to monthly timeframe
    epcwq2 <- epcwq2 |> group_by(date,param,site,unit) |> summarise(value=mean(value)) |> as.data.frame()
    epcwq2$date <- epcwq2$date |> floor_date('month')

The dataset contains 19,602 rows.

save(epcwq2, file = here::here('data/epcwq2.RData'))
head( epcwq2 )
##         date param site unit value
## 1 2000-01-01  Chla   36 ug/l 3.563
## 2 2000-01-01  Chla   38 ug/l 3.161
## 3 2000-01-01  Chla   40 ug/l 2.221
## 4 2000-01-01  Chla   41 ug/l 2.053
## 5 2000-01-01  Chla   46 ug/l 3.255
## 6 2000-01-01  Chla   47 ug/l 2.911

top


Pinellas County water quality data

Pinellas County collects water quality data at five strata along the western boundary of OTB:

  • E1: Safety Harbor / Mobbly Bay
  • E2: Largo Inlet
  • E3: Feather Sound
  • E4: Gateway
  • E5: Weedon Island

The code below loads the raw data from file (../data-raw/DataDownload_2999342_row.csv), performs QC checks, standardizes column names, aggregates the data to a monthly timeframe (monthly means), and creates a new pcwq2 dataframe for analysis.

  # load data
    pcwq1 <- read.csv( here::here("data-raw/DataDownload_2999342_row.csv" ))[,1:21]
  # dates
    pcwq1$Date <- pcwq1$SampleDate |> as.Date("%m/%d/%Y")
  # qualifier codes
    # specify codes
    fatal.codes <- c("*","?","A","B","G","H","J","K",
                     "L","N","O","Q","T","V","Y","Z")
    # define function to locate fatal records
    find.fatal <- function( QUALIFIER, FATAL=fatal.codes ){  # QUALIFER arg is a string
      input.code <- QUALIFIER |> strsplit(split='') |>
                      unlist() |> toupper()  # parse string into single characters
      fatal <- input.code %in% FATAL |> any()  # check if any characters match FATAL
      return( fatal )  # return TRUE or FALSE
    }  # // end find.fatal()
    # apply function to locate fatal records
    rm.fatal.idx <- apply( matrix(pcwq1$QACode,ncol=1), 1, find.fatal ) |> which()
    pcwq1 <- pcwq1[ -rm.fatal.idx, ]
    # parameter names
    param.names.old <- pcwq1$Parameter |> unique() |> sort()
    param.names.new <- c( "BOD", "Chla", "ChlaC", "Chlb", "Chlc", "DO", "DO_Sat",
                          "NOx", "Salinity", "TempW", "TKN", "TN", "TSS", "Turbidity" )
    pcwq1$Analyte <- mapvalues( pcwq1$Parameter, param.names.old, param.names.new )
    # result units
    pcwq1$Unit <- lapply( pcwq1$Parameter, function(x) strsplit(x,"_")[[1]][2] ) |> unlist()
    # check for duplicates (confirm FALSE)
    pcwq1 |> duplicated() |> any()
## [1] FALSE
    # subset and rename columns
    pcwq2 <- pcwq1[, c("Date","Analyte","Result_Value","Unit","StationID") ]
    colnames( pcwq2 ) <- c("date","param", "value", "unit", "site" )
    # aggregate data to monthly timeframe
    pcwq2 <- pcwq2 |> group_by(date,param,site,unit) |> summarise(value=mean(value)) |> as.data.frame()
    pcwq2$date <- pcwq2$date |> floor_date('month')

The dataset contains 11,695 rows.

head( pcwq2 )
##         date param  site unit value
## 1 2000-01-01   BOD 62-01  mgl  1.00
## 2 2000-01-01  Chla 62-01  ugl  2.60
## 3 2000-01-01  Chlb 62-01  ugl  0.50
## 4 2000-01-01  Chlc 62-01  ugl  0.70
## 5 2000-01-01    DO 62-01  mgl  6.38
## 6 2000-01-01   NOx 62-01  ugl 20.00

top


Tampa Bay Estuary Program nitrogen load estimates

This section loads the Tampa Bay Estuary Program’s monthly external loading estimates for Tampa Bay (../data-raw/TB_loads_monthly.csv), computes total monthly TN loads by summing over TN load sources to OTB (e.g., AD, NPS, DPS, etc.), standardizes column names, and creates a new loads dataframe for analysis.

  # load data
  loads <- read.csv( here::here("data-raw/TB_loads_monthly.csv" ))
  # subset TN data for OTB segment
  loads <- loads[ which( loads$bay_segment=="Old Tampa Bay"), c('year','month','tn_load') ]
  # date column
  loads$date <- as.Date( paste0(loads$year,"-",loads$month,"-",01), format="%Y-%m-%d" )
  # aggregate total TN load across sources
  loads <- loads |> group_by(date) |> summarise(tn_load=sum(tn_load)) |> as.data.frame()
  # create columns
  loads$param <- "TN load"
  loads$site <- "OTB watershed"
  loads$unit <- 'tons'
  # rename columns
  colnames( loads ) <- c('date','value','param','site','unit')
  # re-order columns
  loads <- select( loads, date, param, value, unit, site )

The dataset contains 324 rows.

head( loads )
##         date   param    value unit          site
## 1 1995-01-01 TN load 20.78062 tons OTB watershed
## 2 1995-02-01 TN load 19.80657 tons OTB watershed
## 3 1995-03-01 TN load 18.34947 tons OTB watershed
## 4 1995-04-01 TN load 19.93694 tons OTB watershed
## 5 1995-05-01 TN load 58.07647 tons OTB watershed
## 6 1995-06-01 TN load 54.70274 tons OTB watershed

top


Linear Models

Janicki & Wade (1996) developed a linear regression model relating monthly average chlorophyll-a concentrations to cumulative TN loads over a three-month period (the concurrent month and two previous months). The model used data from across Tampa Bay during 1985–1991 and stratified the data by the month of the year and by bay segment, and it produced an R2 value of 0.73. Subsequently, Greening et al. (2014) developed a similar regression model for the 1985–2011 period and observed an R2 value of 0.66.

Nitrogen loads and chlorophyll (3-month cumulative loads)

We apply a similar approach as Janciki & Wade (1996) to explore relationships between monthly average chlorophyll-a concentrations and TN loads at three OTB sub-segments during the 2000-2021 period. For simplicity, the models below do not stratify by the month of the year, in contrast to the model of Janicki & Wade (1996).

We begin by assembling the TN load and chlorophyll-a data from the three OTB sub-segments compartmentalized by the Courtney Campbell, Howard Frankland, and Gandy Blvd causeways/bridges, using water quality data collected by EPC at fixed monitoring locations between 2000 and 2021. The three resulting datasets – epc1, epc2, and epc3 – are arranged in wide format and include the columns date, chla, TN_load (monthly load values), and TN_load_3mo (cumulative 3-month loads).

# Subset and join load and chl-a data for OTB sub-segments
  # Sub-segment north of Courtney Campbell
  epc1 <- epcwq2[ which( epcwq2$site %in% c(46,47,60,61,62,64) ), ]  # subset EPC data by station
  epc1 <- epc1[which(epc1$param=="Chla"),] |> group_by(date) |>  # aggregate chl data to monthly means
            summarise(value=mean(value)) |> as.data.frame()
  epc1 <- left_join( epc1, loads[,c('date','value')], by='date' )  # value.x is chl; value.y is load
  colnames(epc1) <- c("date","chla","TN_load")
  # Sub-segment between Courtney Campbell and Frankland
  epc2 <- epcwq2[ which( epcwq2$site %in% c(40,41,42,63,65,66) ), ]
  epc2 <- epc2[which(epc2$param=="Chla"),] |> group_by(date) |> 
          summarise(value=mean(value)) |> as.data.frame()
  epc2 <- left_join( epc2, loads[,c('date','value')], by='date' ) 
  colnames(epc2) <- c("date","chla","TN_load")
  # Sub-segment between Frankland and Gandy
  epc3 <- epcwq2[ which( epcwq2$site %in% c(38,50,51,67) ), ]
  epc3 <- epc3[which(epc3$param=="Chla"),] |> group_by(date) |> 
        summarise(value=mean(value)) |> as.data.frame()
  epc3 <- left_join( epc3, loads[,c('date','value')], by='date' )
  colnames(epc3) <- c("date","chla","TN_load")
  
# Compute 3-month TN loads
  # Define function to sum load values over a specified window (`win`)
  # The function assumes there are no NAs 
  sum3mo <- function( x, win = 3 ){
    out <- rep(NA,length=(win-1))
    for( i in win:length(x) ){
      out[i] <- sum( x[i:(i-win+1)] )
    }
    return( out )
  }
  # Add a 3-month TN load column to each dataset
  epc1$TN_load_3mo <- sum3mo( epc1$TN_load )
  epc2$TN_load_3mo <- sum3mo( epc2$TN_load )
  epc3$TN_load_3mo <- sum3mo( epc3$TN_load )

The monthly sub-segment chlorophyll-a data and segment-wide TN load data are visualized below.

par( mfrow=c(2,2) )
plot( chla ~ date, data = epc1, pch = 16, col = rgb(0,0.4,0.8,0.7),
      ylim = c(0,80), ylab = "Chlorophyll-a (ug/l)", xlab = "", las = 1,
      main = "Chlorophyll-a, OTB north of Courtney Campbell" )
  lines( chla ~ date, data = epc1, col = rgb(0,0.4,0.8,0.4) )
  abline( v = seq.Date(as.Date("2000-01-01"),as.Date("2024-01-01"),'year'),
          col = rgb(0,0,0,0.2) )
  abline( h = axTicks(2), col = rgb(0,0,0,0.2) )
plot( chla ~ date, data = epc2, pch = 16, col = rgb(0,0.4,0.8,0.7),
      ylim = c(0,80), ylab = "Chlorophyll-a (ug/l)", xlab = "", las = 1,
      main = "Chlorophyll-a, OTB between Frankland & Courtney Campbell" )
  lines( chla ~ date, data = epc2, col = rgb(0,0.4,0.8,0.4) )
  abline( v = seq.Date(as.Date("2000-01-01"),as.Date("2024-01-01"),'year'),
          col = rgb(0,0,0,0.2) )
  abline( h = axTicks(2), col = rgb(0,0,0,0.2) )
plot( chla ~ date, data = epc3, pch = 16, col = rgb(0,0.4,0.8,0.7),
      ylim = c(0,80), ylab = "Chlorophyll-a (ug/l)", xlab = "", las = 1,
      main = "Chlorophyll-a, OTB between Gandy & Frankland" )
  lines( chla ~ date, data = epc3, col = rgb(0,0.4,0.8,0.4) )
  abline( v = seq.Date(as.Date("2000-01-01"),as.Date("2024-01-01"),'year'),
          col = rgb(0,0,0,0.2) )
  abline( h = axTicks(2), col = rgb(0,0,0,0.2) )
plot( TN_load ~ date, data = epc3, pch = 16, col = rgb(0.2,0.2,0.2,0.7),
      ylab = "TN load (tons)", xlab = "", las = 1,
      main = "Monthly TN load, Old Tampa Bay" )
  lines( TN_load ~ date, data = epc3, col = rgb(0.2,0.2,0.2,0.4) )
  abline( v = seq.Date(as.Date("2000-01-01"),as.Date("2024-01-01"),'year'),
          col = rgb(0,0,0,0.2) )
  abline( h = axTicks(2), col = rgb(0,0,0,0.2) )

Below, we fit linear models for chlorophyll as a function of cumulative 3-month TN loads at each sub-segment and visualize the results. The TN_chl_plot() function takes the data (dat) and model (mod) as input and produces a scatterplot of the data with the regression line. The models show significant relationships (P<0.001) and R2 values of 0.32, 0.37, and 0.46. However, these R2 values are not directly comparable to the R2=0.73 of the Janicki & Wade (1996) model, because the models differ in spatial scale (OTB sub-segments vs. Tampa Bay as a whole). Further, the residuals of each model do not appear to be homoskedastic nor Gaussian (diagnostic visualizations follow); experimentation with alternative model specifications that include a month-of-year term did not improve the model diagnostics.

# Define plotting function for data and model
  TN_chl_plot <- function( dat, mod, lab.main ){
     # 95% CI
     pred.seq <- seq( min( dat$TN_load_3mo, na.rm=TRUE ),
                      max( dat$TN_load_3mo, na.rm=TRUE ), length.out=100 )
     pred <- predict.gam( mod,
                          newdata = data.frame( TN_load_3mo = pred.seq ),
                          se.fit=TRUE )
     
     # P value
      pval <- summary(mod)$p.table[2,4]
      pcol1 <- rgb(1,0,0,0.4)
      pcol2 <- rgb(1,0,0,0.15)
      if( pval < 0.001 ){
        pstring <- "P < 0.001"
      } else if( pval < 0.01 ){
        pstring <- "P < 0.01"
      } else if( pval < 0.05 ){
        pstring <- "P < 0.05"
      } else {
        pstring <- "P > 0.05"
        pcol1 <- rgb(0.2,0.3,0.9,0.4)
        pcol2 <- rgb(0,0.1,0.6,0.15)
      }
      
      # scatterplot data
      plot( chla ~ TN_load_3mo, data = dat,
            col = rgb(0,0,0,0.6), cex = 1.4,
            main = paste( lab.main,"\n",
                          min(dat$date), " - ", max(dat$date) ),
            cex.main = 1.6, cex.lab = 1.4,
            xlim = range( c(0,dat$TN_load_3mo,pred.seq), na.rm=TRUE ),
            ylim = range( c(0,dat$chla,pred$fit-pred$se.fit*1.96,
                            pred$fit+pred$se.fit*1.96)*1.2 ),
            xlab = "3-month TN load (tons)", cex.axis = 1.4,
            ylab = "Chlorophyll-a (ug/l)", las = 1
      )
      abline( v = axTicks(1), col = rgb(0,0,0,0.2) )
      abline( h = axTicks(2), col = rgb(0,0,0,0.2) )
      abline( v = 0, col = rgb(0,0,0,0.5) )
      abline( h = 0, col = rgb(0,0,0,0.5) )
      
      # plot model fit
      lines( pred$fit ~ pred.seq, col = pcol1, lwd = 2, lty = 2 )
      polygon( x = c( pred.seq, rev(pred.seq) ),
               y = c( (pred$fit+pred$se.fit*1.96), rev((pred$fit-pred$se.fit*1.96)) ),
               col = pcol2, border = rgb(0,0,0,0) )
      # write model R2 and pval
      legend( 'top', horiz = TRUE, bty = 'n',
              border = rgb(0,0,0,0), cex = 1.4,
              legend = paste0( "N = ", nrow(dat), "   ",
                               "R2 = ", round(summary(mod)$r.sq,3),
                               "   ", pstring )
              )
  }  # // end TN_chl_plot() function

# remove incomplete cases in the data
epc1 <- epc1[ complete.cases(epc1), ]
epc2 <- epc2[ complete.cases(epc2), ]
epc3 <- epc3[ complete.cases(epc3), ]

# models
ncmod1 <- gam( chla ~ TN_load_3mo, data = epc1 )
ncmod2 <- gam( chla ~ TN_load_3mo, data = epc2 )
ncmod3 <- gam( chla ~ TN_load_3mo, data = epc3 )

# plots
par( mfrow=c(1,3) )
TN_chl_plot( epc1, ncmod1, "Chla north of CC vs. 3mo TN load (OTB)" )
TN_chl_plot( epc2, ncmod2, "Chla bw HF-CC vs. 3mo TN load (OTB)" )
TN_chl_plot( epc3, ncmod3, "Chla bw GB-HF vs. 3mo TN load (OTB)" )

The diagnostics for each of the above OTB models indicate that the assumption of homoskedastic, Gaussian residuals is not met.

par(mfrow=c(2,2))
  # North of CC
  gam.check(ncmod1,rep=1000)

## 
## Method: GCV   Optimizer: magic
## Model required no smoothing parameter selectionModel rank =  2 / 2
  # Between CC and HF
  gam.check(ncmod2,rep=1000)

## 
## Method: GCV   Optimizer: magic
## Model required no smoothing parameter selectionModel rank =  2 / 2
  # Between HF and GB
  gam.check(ncmod3,rep=1000)

## 
## Method: GCV   Optimizer: magic
## Model required no smoothing parameter selectionModel rank =  2 / 2

top


Nitrogen loads and chlorophyl (concurrent and delayed loads)

This section further explores relationships between chlorophyll-a and TN loads by considering monthly TN loading values at delays of zero through two months across various OTB sub-segments, including the EPC sub-segments and the Pinellas County strata.

The code chunk below defines the OTB_corr() function, which computes and visualizes a linear fit between a specified explanatory variable x and response variable y, after joining the datasets by date; the input data may come from two different dataframes. The function’s arguments are described in comments in the first few lines of the code chunk.

The delay argument allows the user to specify a non-negative integer time delay between the explanatory and response variables. By default, the delay=0 setting pairs response values with concurrent explanatory values (e.g., TN loads and chlorophyll concentrations during the same month); setting delay to a positive values lags the explanatory variable by the specified number of time periods. The log argument (logical) allows the user to specify whether to log10-transform x and y prior to the regression. The function outputs an x-y scatterplot that displays the linear fit with a 95% confidence band, as well as R2 and P-value estimates. The color of the regression line indicates whether the estimated slope is statistically significant (red) or not significant (gray) at alpha = 0.05. The function returns the input data and the model object.

  # define regression/plot function
  # x, y      dataframes for explanatory (x) and response (y) variables,
  #           each containing columns date, param, value, unit, 
  # param.x,  water quality parameters in x$param, y$param (strings)
  #  param.y
  # site.x,   location labels in x$site, y$site (string)
  #  site.y
  # log       logical, if TRUE log10-transform the x and y data. FALSE by default
  # delay     number of time steps to delay param.x (non-negative integer)
  OTB_corr <- function( x, y,
                        param.x, param.y,
                        site.x, site.y = site.x,
                        log = FALSE,
                        delay = 0 ){
    
    # units
    unit.x <- x$unit[ which( x$param==param.x ) ] |> unique()
    unit.y <- y$unit[ which( y$param==param.y ) ] |> unique()
    
    # assemble data for x and y in wide format
    dat.x <- x[ which( x$param==param.x & x$site==site.x ), ]
    dat.x <- dat.x |> group_by(date) |> summarise(value=mean(value)) |> as.data.frame()
    dat.y <- y[ which( y$param==param.y & y$site==site.y ), ]
    dat.y <- dat.y |> group_by(date) |> summarise(value=mean(value)) |> as.data.frame()
    dat <- dplyr::inner_join( dat.x, dat.y, by = 'date' )
    
    # log10-transform
    if( log == TRUE ){
      dat$value.x[ which( dat$value.x != 0) ] <- dat$value.x |> log10()
      dat$value.y[ which( dat$value.y != 0) ] <- dat$value.y |> log10()
      unit.x <- paste( "log10", unit.x )
      unit.y <- paste( "log10", unit.y )
    }
    
    # delay
    if( delay > 0 ){
      dateseq <- data.frame( date = seq.Date( min(dat$date), max(dat$date), 'month' ) )
      dat <- dplyr::left_join( dateseq, dat, by = 'date' )
      dat$value.x <- dplyr::lag( dat$value.x, n = delay )
      dat <- na.omit( dat )
    } else if ( delay < 0 ){
      stop("Negative delay specified.")
    }
    
    # fit linear model
    # fit model
    mod1 <- gam( value.y ~ value.x, data = dat )
    # plot fit with 95% CI
    pred.seq <- seq( min( dat$value.x ), max( dat$value.x ), length.out=100 )
    pred <- predict.gam( mod1,
                         newdata = data.frame( value.x = pred.seq ),
                         se.fit=TRUE )
    # P value
    pval <- summary(mod1)$p.table[2,4]
    pcol1 <- rgb(1,0,0,0.4)
    pcol2 <- rgb(1,0,0,0.15)
    if( pval < 0.001 ){
      pstring <- "P < 0.001"
    } else if( pval < 0.01 ){
      pstring <- "P < 0.01"
    } else if( pval < 0.05 ){
      pstring <- "P < 0.05"
    } else {
      pstring <- "P > 0.05"
      pcol1 <- rgb(0.2,0.3,0.9,0.4)
      pcol2 <- rgb(0,0.1,0.6,0.15)
    }
    
    # scatterplot data
    plot( value.y ~ value.x, data = dat,
          col = rgb(0,0,0,0.6), cex = 1.4,
          main = paste( param.y, "at", site.y, " vs. ", param.x, "at", site.x,"\n",
                        min(dat$date), " - ", max(dat$date) ),
          cex.main = 1.6, cex.lab = 1.4,
          xlim = range( c(0,dat$value.x,pred.seq) ),
          ylim = range( c(0,dat$value.y,pred$fit-pred$se.fit*1.96,
                          pred$fit+pred$se.fit*1.96)*1.2 ),
          xlab = paste0( param.x, " (", unit.x, ")" ), cex.axis = 1.4,
          ylab = paste0( param.y, " (", unit.y, ")" ), las = 1
    )
    abline( v = axTicks(1), col = rgb(0,0,0,0.2) )
    abline( h = axTicks(2), col = rgb(0,0,0,0.2) )
    abline( v = 0, col = rgb(0,0,0,0.5) )
    abline( h = 0, col = rgb(0,0,0,0.5) )
    
    # plot model fit
    lines( pred$fit ~ pred.seq, col = pcol1, lwd = 2, lty = 2 )
    polygon( x = c( pred.seq, rev(pred.seq) ),
             y = c( (pred$fit+pred$se.fit*1.96), rev((pred$fit-pred$se.fit*1.96)) ),
             col = pcol2, border = rgb(0,0,0,0) )
    # write model R2 and pval
    legend( 'top', horiz = TRUE, bty = 'n',
            border = rgb(0,0,0,0), cex = 1.4,
            legend = paste0( "N = ", nrow(dat), "   ",
                             "R2 = ", round(summary(mod1)$r.sq,3),
                             "   ", pstring, "   delay = ", delay )
    )
    
    return( list( data = dat, model = mod1 ) )
    
  }  # // end OTB_corr()

The plots below show relationships between log10-transformed external TN loads and chlorophyll-a concentrations within the three OTB sub-segments compartmentalized by the Courtney Campbell, Howard Frankland, and Gandy Blvd causeways/bridges, using water quality data collected by EPC at fixed monitoring locations between 2000 and 2021. The three OTB compartments span the three rows, and the three columns pertain to delays of zero, one, and two months. All three OTB compartments show significant positive correlations at each delay, with a maximum R2 value of 0.36. In each case the data exhibit substantial scatter (unexplained variance) around the regression line. Although not visualized below, the diagnostics for each model indicate that the residuals are homoskedastic and approximately Gaussian.

  par(mfrow=c(3,3))
  # assemble and re-label data for locations north of Courtney Campbell
  epc1 <- epcwq2[ which( epcwq2$site %in% c(46,47,60,61,62,64) ), ]
  epc1$site <- "North CC"
  # models
  epc1_load_Chla_0 <- OTB_corr( x = loads, y = epc1, 'TN load', 'Chla', 
                                'OTB watershed', 'North CC', log = TRUE, delay = 0 )
  epc1_load_Chla_1 <- OTB_corr( x = loads, y = epc1, 'TN load', 'Chla',
                                'OTB watershed', 'North CC', log = TRUE, delay = 1 )
  epc1_load_Chla_2 <- OTB_corr( x = loads, y = epc1, 'TN load', 'Chla',
                                'OTB watershed', 'North CC', log = TRUE, delay = 2 )
  
  # assemble and re-label data for locations between Frankland & Campbell
  epc2 <- epcwq2[ which( epcwq2$site %in% c(40,41,42,63,65,66) ), ]
  epc2$site <- "HF-CC"
  # models
  epc2_load_Chla_0 <- OTB_corr( x = loads, y = epc2, 'TN load', 'Chla',
                                'OTB watershed', 'HF-CC', log = TRUE, delay = 0 )
  epc2_load_Chla_1 <- OTB_corr( x = loads, y = epc2, 'TN load', 'Chla',
                                'OTB watershed', 'HF-CC', log = TRUE, delay = 1 )
  epc2_load_Chla_2 <- OTB_corr( x = loads, y = epc2, 'TN load', 'Chla',
                                'OTB watershed', 'HF-CC', log = TRUE, delay = 2 )
  
  # assemble and re-label data for locations between Gandy & Frankland
  epc3 <- epcwq2[ which( epcwq2$site %in% c(38,50,51,67) ), ]
  epc3$site <- "GB-HF"
  # models
  epc3_load_Chla_0 <- OTB_corr( x = loads, y = epc3, 'TN load', 'Chla',
                                'OTB watershed', 'GB-HF', log = TRUE, delay = 0 )
  epc3_load_Chla_1 <- OTB_corr( x = loads, y = epc3, 'TN load', 'Chla',
                                'OTB watershed', 'GB-HF', log = TRUE, delay = 1 )
  epc3_load_Chla_2 <- OTB_corr( x = loads, y = epc3, 'TN load', 'Chla',
                                'OTB watershed', 'GB-HF', log = TRUE, delay = 2 )

The plots below show relationships between log10-transformed TN loads and chlorophyll-a concentrations within Pinellas County strata E1–E5 (rows) at delays of 0–2 months (columns). During the same month (first column), all strata show significant positive correlations, with a maximum R2 value of 0.27. The data typically exhibit substantial scatter (unexplained variance) around the regression lines. Across all strata, with the exception of E1, significant relationships are also observed at delays of one month (second column) and two months (third column), although the R2 values are substantially lower. Although not visualized below, the diagnostics for each model indicate that the residuals are homoskedastic and approximately Gaussian.

  par(mfrow=c(5,3))
  E1_load_Chla_0 <- OTB_corr( x = loads, y = pcwq2, 'TN load', 'Chla', 'OTB watershed', 'E1',
                              log = TRUE, delay = 0 )
  E1_load_Chla_1 <- OTB_corr( x = loads, y = pcwq2, 'TN load', 'Chla', 'OTB watershed', 'E1',
                              log = TRUE, delay = 1 )
  E1_load_Chla_2 <- OTB_corr( x = loads, y = pcwq2, 'TN load', 'Chla', 'OTB watershed', 'E1',
                              log = TRUE, delay = 2 )
  E2_load_Chla_0 <- OTB_corr( x = loads, y = pcwq2, 'TN load', 'Chla', 'OTB watershed', 'E2',
                              log = TRUE, delay = 0 )
  E2_load_Chla_1 <- OTB_corr( x = loads, y = pcwq2, 'TN load', 'Chla', 'OTB watershed', 'E2',
                              log = TRUE, delay = 1 )
  E2_load_Chla_2 <- OTB_corr( x = loads, y = pcwq2, 'TN load', 'Chla', 'OTB watershed', 'E2',
                              log = TRUE, delay = 2 )
  E3_load_Chla_0 <- OTB_corr( x = loads, y = pcwq2, 'TN load', 'Chla', 'OTB watershed', 'E3',
                              log = TRUE, delay = 0 )
  E3_load_Chla_1 <- OTB_corr( x = loads, y = pcwq2, 'TN load', 'Chla', 'OTB watershed', 'E3',
                              log = TRUE, delay = 1 )
  E3_load_Chla_2 <- OTB_corr( x = loads, y = pcwq2, 'TN load', 'Chla', 'OTB watershed', 'E3',
                              log = TRUE, delay = 2 )
  E4_load_Chla_0 <- OTB_corr( x = loads, y = pcwq2, 'TN load', 'Chla', 'OTB watershed', 'E4',
                              log = TRUE, delay = 0 )
  E4_load_Chla_1 <- OTB_corr( x = loads, y = pcwq2, 'TN load', 'Chla', 'OTB watershed', 'E4',
                              log = TRUE, delay = 1 )
  E4_load_Chla_2 <- OTB_corr( x = loads, y = pcwq2, 'TN load', 'Chla', 'OTB watershed', 'E4',
                              log = TRUE, delay = 2 )
  E5_load_Chla_0 <- OTB_corr( x = loads, y = pcwq2, 'TN load', 'Chla', 'OTB watershed', 'E5',
                              log = TRUE, delay = 0 )
  E5_load_Chla_1 <- OTB_corr( x = loads, y = pcwq2, 'TN load', 'Chla', 'OTB watershed', 'E5',
                              log = TRUE, delay = 1 )
  E5_load_Chla_2 <- OTB_corr( x = loads, y = pcwq2, 'TN load', 'Chla', 'OTB watershed', 'E5',
                              log = TRUE, delay = 2 )

top


Chlorophyll and light

This section explores relationships between log10-transformed chlorophyll-a concentrations and Secchi depths within the three OTB sub-segments compartmentalized by the Courtney Campbell, Howard Frankland, and Gandy Blvd causeways/bridges, using data collected by EPC at fixed monitoring locations between 2000 and 2023. Following Greening & Janicki (2016), the variables are log-transformed prior to analysis. When interpreting these results, it is important to consider that the Secchi data contain a substantial number of records in which the Secchi depth equals the total depth (i.e., the Secchi disk was visible at bottom). In these cases, the reported Secchi depth is not necessarily a reliable proxy of water clarity (e.g., in shallow areas), and the presence of these records may bias the regression results.

All three OTB compartments show significant negative correlations at delays of zero and one month (first and second columns), with a maximum R2 value of 0.38. The magnitudes of the slopes and the R2 values both decrease between delay=0 and delay=1. And in each case, the data exhibit substantial scatter around the regression line, indicating that any given chlorophyll concentration is typically associated with a large range of Secchi values. At a delay of two months (third column), none of the OTB compartments shows a significant correlation. Although not visualized below, the diagnostics for each model indicate that the residuals are homoskedastic and approximately Gaussian.

  par(mfrow=c(3,3))

  # models
  epc1_Chla_Secchi_0 <- OTB_corr( x = epc1, y = epc1, 'Chla', 'Secchi',
                                'North CC', log = TRUE, delay = 0 )
  epc1_Chla_Secchi_1 <- OTB_corr( x = epc1, y = epc1, 'Chla', 'Secchi',
                                'North CC', log = TRUE, delay = 1 )
  epc1_Chla_Secchi_2 <- OTB_corr( x = epc1, y = epc1, 'Chla', 'Secchi',
                                'North CC', log = TRUE, delay = 2 )
  
  # models
  epc2_Chla_Secchi_0 <- OTB_corr( x = epc2, y = epc2, 'Chla', 'Secchi',
                                'HF-CC', log = TRUE, delay = 0 )
  epc2_Chla_Secchi_1 <- OTB_corr( x = epc2, y = epc2, 'Chla', 'Secchi',
                                'HF-CC', log = TRUE, delay = 1 )
  epc2_Chla_Secchi_2 <- OTB_corr( x = epc2, y = epc2, 'Chla', 'Secchi',
                                'HF-CC', log = TRUE, delay = 2 )
  
  # models
  epc3_Chla_Secchi_0 <- OTB_corr( x = epc3, y = epc3, 'Chla', 'Secchi',
                                'GB-HF', log = TRUE, delay = 0 )
  epc3_Chla_Secchi_1 <- OTB_corr( x = epc3, y = epc3, 'Chla', 'Secchi',
                                'GB-HF', log = TRUE, delay = 1 )
  epc3_Chla_Secchi_2 <- OTB_corr( x = epc3, y = epc3, 'Chla', 'Secchi',
                                'GB-HF', log = TRUE, delay = 2 )

top


Generalized Additive Models (GAMs)

Next, we explore whether nitrogen-chlorophyll and chlorophyll-Secchi relationships have changed over time across OTB sub-segments.


Nitrogen and chlorophyll

This section explores whether relationships between monthly averaged chlorophyll concentrations and concurrent TN loads have changed over time within the three sub-segments divided by bridges/causeways (using EPC data) and within the five Pinellas County strata (E1–E5).

North of Courtney Campbell

We assemble the EPC data and specify a GAM that estimates the chlorophyll-a concentration using smooth terms for the decimal date (decdate), TN load (value.x), and their tensor product interaction. Model diagnostics indicate that the residuals are approximately Gaussian (QQ plot) and that the model had sufficient basis dimensions (k).

  # join TN load data to chlorophyll data
  epc1 <- epcwq2[ which( epcwq2$site %in% c(46,47,60,61,62,64) ), ]  # subset EPC data by station
  epc1 <- epc1[which(epc1$param=="Chla"),] |> group_by(date) |>  # aggregate chl data to monthly means
            summarise(value=mean(value)) |> as.data.frame()
  gamdat_epc1 <- inner_join( x = epc1, y = loads, by = 'date' )
  nrow(gamdat_epc1)
## [1] 261
  # decimal date
  gamdat_epc1$decdate <- gamdat_epc1$date |> decimal_date()
  # fit gam
  gam_epc1 <- gam(  log10(value.y) ~ te( decdate, value.x, k=11 ), data = gamdat_epc1 )
  # model diagnostics
  par(mfrow=c(2,2))
  gam.check( gam_epc1, rep = 1000 )

## 
## Method: GCV   Optimizer: magic
## Smoothing parameter selection converged after 12 iterations.
## The RMS GCV score gradient at convergence was 1.781363e-06 .
## The Hessian was positive definite.
## Model rank =  121 / 121 
## 
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
## 
##                      k' edf k-index p-value
## te(decdate,value.x) 120  13    0.98    0.35
  summary( gam_epc1 )
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## log10(value.y) ~ te(decdate, value.x, k = 11)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.50621    0.01844   81.69   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                       edf Ref.df     F p-value    
## te(decdate,value.x) 13.03   17.5 8.752  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.368   Deviance explained =   40%
## GCV = 0.093781  Scale est. = 0.08874   n = 261

The plots below visualize the estimated te() coefficient surface and suggest that the association between chlorophyll concentrations and TN load (value.x) has increased over time (decdate) within the sub-segment. In the heatmap image, colors range from red (relatively low magnitude) to yellow (higher magnitude). White patches indicate gaps in the input space.

  # plot smooth
  par(mfrow=c(1,2))
  plot.gam( gam_epc1, scheme = 1 )
  plot.gam( gam_epc1, scheme = 2 )

top

Between Howard Frankland and Courtney Campbell

The model shows good diagnostics.

  # join TN load data to chlorophyll data
  epc2 <- epcwq2[ which( epcwq2$site %in% c(40,41,42,63,65,66) ), ]
  epc2 <- epc2[which(epc2$param=="Chla"),] |> group_by(date) |> 
            summarise(value=mean(value)) |> as.data.frame()
  gamdat_epc2 <- inner_join( x = epc2, y = loads, by = 'date' )
  nrow(gamdat_epc2)
## [1] 261
  # decimal date
  gamdat_epc2$decdate <- gamdat_epc2$date |> decimal_date()
  # fit gam
  gam_epc2 <- gam(  log10(value.y) ~ te( decdate, value.x, k=11 ), data = gamdat_epc2 )
  # model diagnostics
  par(mfrow=c(2,2))
  gam.check( gam_epc2, rep = 1000 )

## 
## Method: GCV   Optimizer: magic
## Smoothing parameter selection converged after 19 iterations.
## The RMS GCV score gradient at convergence was 8.974192e-08 .
## The Hessian was positive definite.
## Model rank =  121 / 121 
## 
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
## 
##                         k'    edf k-index p-value  
## te(decdate,value.x) 120.00   5.79    0.94    0.04 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  summary( gam_epc2 )
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## log10(value.y) ~ te(decdate, value.x, k = 11)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.50621    0.01877   80.24   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                       edf Ref.df     F p-value    
## te(decdate,value.x) 5.789   6.76 20.55  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.345   Deviance explained =   36%
## GCV = 0.094419  Scale est. = 0.091963  n = 261

The plots below suggest that the association between chlorophyll concentrations and TN load (value.x) has increased over time (decdate) within the sub-segment. In the heatmap image, colors range from red (relatively low magnitude) to yellow (higher magnitude). Comparisons across sub-segments require consideration of the te() coefficient surfaces as well as each model’s estimated intercept.

  # plot smooth
  par(mfrow=c(1,2))
  plot.gam( gam_epc2, scheme = 1 )
  plot.gam( gam_epc2, scheme = 2 )

top

Between Gandy Blvd and Howard Frankland

The model shows good diagnostics.

 # join TN load data to chlorophyll data
  epc3 <- epcwq2[ which( epcwq2$site %in% c(38,50,51,67) ), ]
  epc3 <- epc3[which(epc3$param=="Chla"),] |> group_by(date) |> 
            summarise(value=mean(value)) |> as.data.frame()
  gamdat_epc3 <- inner_join( x = epc3, y = loads, by = 'date' )
  nrow(gamdat_epc3)
## [1] 261
  # decimal date
  gamdat_epc3$decdate <- gamdat_epc3$date |> decimal_date()
  # fit gam
  gam_epc3 <- gam(  log10(value.y) ~ te( decdate, value.x, k=11 ), data = gamdat_epc3 )
  # model diagnostics
  par(mfrow=c(2,2))
  gam.check( gam_epc3, rep = 1000 )

## 
## Method: GCV   Optimizer: magic
## Smoothing parameter selection converged after 15 iterations.
## The RMS GCV score gradient at convergence was 2.640248e-06 .
## The Hessian was positive definite.
## Model rank =  121 / 121 
## 
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
## 
##                      k' edf k-index p-value
## te(decdate,value.x) 120  18    1.03    0.66
  summary( gam_epc3 )
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## log10(value.y) ~ te(decdate, value.x, k = 11)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.50621    0.01905   79.08   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                       edf Ref.df     F p-value    
## te(decdate,value.x) 17.96  21.99 6.071  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.326   Deviance explained = 37.3%
## GCV = 0.1021  Scale est. = 0.094686  n = 261

The plots below suggest that the association between chlorophyll concentrations and TN load (value.x) has generally increased over time (decdate) within the sub-segment, but the dynamics here are a bit more complex than in the two previous sub-segments. In the heatmap image, colors range from red (relatively low magnitude) to yellow (higher magnitude). Comparisons across sub-segments require consideration of the te() coefficient surfaces as well as each model’s estimated intercept.

  # plot smooth
  par(mfrow=c(1,2))
  plot.gam( gam_epc3, scheme = 1 )
  plot.gam( gam_epc3, scheme = 2 )

top

E1 (Safety Harbor / Mobbly Bay)

The model shows good diagnostics.

  # join TN load data to chlorophyll data
  gamdat_E1 <- inner_join( x = pcwq2[ which( pcwq2$param=='Chla' & pcwq2$site=="E1" ), ],
                           y = loads,
                           by = 'date'
                          )
  nrow(gamdat_E1)
## [1] 127
  # decimal date
  gamdat_E1$decdate <- gamdat_E1$date |> decimal_date()
  # fit gam
  gam1 <- gam(  log10(value.y) ~ te( decdate, value.x, k=11 ), data = gamdat_E1 )
  # model diagnostics
  par(mfrow=c(2,2))
  gam.check( gam1, rep = 1000 )

## 
## Method: GCV   Optimizer: magic
## Smoothing parameter selection converged after 22 iterations.
## The RMS GCV score gradient at convergence was 6.398965e-08 .
## The Hessian was positive definite.
## Model rank =  121 / 121 
## 
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
## 
##                      k' edf k-index p-value
## te(decdate,value.x) 120   3    1.07     0.8
  summary( gam1 )
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## log10(value.y) ~ te(decdate, value.x, k = 11)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.53432    0.02782   55.15   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                     edf Ref.df     F p-value    
## te(decdate,value.x)   3      3 15.54  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.257   Deviance explained = 27.5%
## GCV = 0.10149  Scale est. = 0.098294  n = 127

The plots below visualize the estimated te() coefficient surface and suggest that the association between chlorophyll concentrations and TN load (value.x) has not substantially changed over time (decdate). In the heatmap image, colors range from red (relatively low magnitude) to yellow (higher magnitude). White patches indicate gaps in the input space. Comparisons across sub-segments require consideration of the te() coefficient surfaces as well as each model’s estimated intercept.

  # plot smooth
  par(mfrow=c(1,2))
  plot.gam( gam1, scheme = 1)
  plot.gam( gam1, scheme = 2)

top

E2 (Largo Inlet)

The model shows good diagnostics.

  # join TN load data to chlorophyll data
  gamdat_E2 <- inner_join( x = pcwq2[ which( pcwq2$param=='Chla' & pcwq2$site=="E2" ), ],
                           y = loads,
                           by = 'date'
                          )
  nrow(gamdat_E2)
## [1] 132
  # decimal date
  gamdat_E2$decdate <- gamdat_E2$date |> decimal_date()
  # fit gam
  gam2 <- gam(  log10(value.y) ~ te( decdate, value.x, k=11 ), data = gamdat_E2 )
  # model diagnostics
  par(mfrow=c(2,2))
  gam.check( gam2, rep = 1000 )

## 
## Method: GCV   Optimizer: magic
## Smoothing parameter selection converged after 12 iterations.
## The RMS GCV score gradient at convergence was 1.556571e-06 .
## The Hessian was positive definite.
## Model rank =  121 / 121 
## 
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
## 
##                         k'    edf k-index p-value
## te(decdate,value.x) 120.00   7.94    1.01    0.46
  summary( gam2 )
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## log10(value.y) ~ te(decdate, value.x, k = 11)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.5445     0.0288   53.62   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                       edf Ref.df     F  p-value    
## te(decdate,value.x) 7.938  10.37 3.734 0.000182 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.217   Deviance explained = 26.5%
## GCV = 0.11745  Scale est. = 0.1095    n = 132

The plots below suggest that the association between chlorophyll concentrations and TN load (value.x) has changed substantially over time (decdate) at Largo Inlet, with TN loads (value.x) most strongly associated with chlorophyll concentrations between about 2010 and 2015. Comparisons across sub-segments require consideration of the te() coefficient surfaces as well as each model’s estimated intercept.

  # plot smooth
  par(mfrow=c(1,2))
  plot.gam( gam2, scheme = 1)
  plot.gam( gam2, scheme = 2)

top

E3 (Feather Sound)

The model shows good diagnostics.

  # join TN load data to chlorophyll data
  gamdat_E3 <- inner_join( x = pcwq2[ which( pcwq2$param=='Chla' & pcwq2$site=="E3" ), ],
                           y = loads,
                           by = 'date'
                          )
  nrow(gamdat_E3)
## [1] 132
  # decimal date
  gamdat_E3$decdate <- gamdat_E3$date |> decimal_date()
  # fit gam
  gam3 <- gam(  log10(value.y) ~ te( decdate, value.x, k=11 ), data = gamdat_E3 )
  # model diagnostics
  par(mfrow=c(2,2))
  gam.check( gam3, rep = 1000 )

## 
## Method: GCV   Optimizer: magic
## Smoothing parameter selection converged after 22 iterations.
## The RMS GCV score gradient at convergence was 9.880194e-08 .
## The Hessian was positive definite.
## Model rank =  121 / 121 
## 
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
## 
##                         k'    edf k-index p-value  
## te(decdate,value.x) 120.00   4.52    0.89    0.06 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  summary( gam3 )
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## log10(value.y) ~ te(decdate, value.x, k = 11)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.52336    0.02802   54.36   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                       edf Ref.df     F  p-value    
## te(decdate,value.x) 4.518  5.194 6.702 1.18e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.197   Deviance explained = 22.5%
## GCV = 0.10818  Scale est. = 0.10366   n = 132

The plots below suggest that the association between chlorophyll concentrations and TN load (value.x) has not changed substantially over time (decdate) at Feather Sound. Comparisons across sub-segments require consideration of the te() coefficient surfaces as well as each model’s estimated intercept.

  # plot smooth
  par(mfrow=c(1,2))
  plot.gam( gam3, scheme = 1)
  plot.gam( gam3, scheme = 2)

top

E4 (Gateway)

The model shows good diagnostics.

  # join TN load data to chlorophyll data
  gamdat_E4 <- inner_join( x = pcwq2[ which( pcwq2$param=='Chla' & pcwq2$site=="E4" ), ],
                           y = loads,
                           by = 'date'
                          )
  nrow(gamdat_E4)
## [1] 126
  # decimal date
  gamdat_E4$decdate <- gamdat_E4$date |> decimal_date()
  # fit gam
  gam4 <- gam(  log10(value.y) ~ te( decdate, value.x, k=11 ), data = gamdat_E4 )
  # model diagnostics
  par(mfrow=c(2,2))
  gam.check( gam4, rep = 1000 )

## 
## Method: GCV   Optimizer: magic
## Smoothing parameter selection converged after 40 iterations.
## The RMS GCV score gradient at convergence was 6.314619e-06 .
## The Hessian was positive definite.
## Model rank =  121 / 121 
## 
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
## 
##                         k'    edf k-index p-value
## te(decdate,value.x) 120.00   5.71    0.97    0.31
  summary( gam4 )
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## log10(value.y) ~ te(decdate, value.x, k = 11)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.52974    0.02616   58.47   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                       edf Ref.df     F p-value    
## te(decdate,value.x) 5.712  7.127 8.513  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.322   Deviance explained = 35.3%
## GCV = 0.091092  Scale est. = 0.08624   n = 126

The plots below suggest that the association between chlorophyll concentrations and TN load (value.x) has not changed substantially over time (decdate) at this sub-segment. Comparisons across sub-segments require consideration of the te() coefficient surfaces as well as each model’s estimated intercept.

  # plot smooth
  par(mfrow=c(1,2))
  plot.gam( gam4, scheme = 1)
  plot.gam( gam4, scheme = 2)

top

E5 (Weedon Island)

The model shows good diagnostics.

  # join TN load data to chlorophyll data
  gamdat_E5 <- inner_join( x = pcwq2[ which( pcwq2$param=='Chla' & pcwq2$site=="E5" ), ],
                           y = loads,
                           by = 'date'
                          )
  nrow(gamdat_E5)
## [1] 128
  # decimal date
  gamdat_E5$decdate <- gamdat_E5$date |> decimal_date()
  # fit gam
  gam5 <- gam(  log10(value.y) ~ te( decdate, value.x, k=11 ), data = gamdat_E5 )
  # model diagnostics
  par(mfrow=c(2,2))
  gam.check( gam5, rep = 1000 )

## 
## Method: GCV   Optimizer: magic
## Smoothing parameter selection converged after 16 iterations.
## The RMS GCV score gradient at convergence was 1.229579e-07 .
## The Hessian was positive definite.
## Model rank =  121 / 121 
## 
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
## 
##                        k'   edf k-index p-value
## te(decdate,value.x) 120.0  12.9    1.21    0.99
  summary( gam5 )
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## log10(value.y) ~ te(decdate, value.x, k = 11)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.51629    0.02511   60.39   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                       edf Ref.df     F p-value    
## te(decdate,value.x) 12.94  16.78 5.789  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.421   Deviance explained =   48%
## GCV = 0.090549  Scale est. = 0.080691  n = 128

The plots below suggests a possible bifurcation in the association between chlorophyll concentrations and TN load (value.x) at Weedon Island over time (decdate), for moderate to high levels of TN load. Comparisons across sub-segments require consideration of the te() coefficient surfaces as well as each model’s estimated intercept.

  # plot smooth
  par(mfrow=c(1,2))
  plot.gam( gam5, scheme = 1)
  plot.gam( gam5, scheme = 2)

top


Chlorophyll and light

This section explores whether relationships between chlorophyll concentrations and Secchi depth have changed over time across the three OTB sub-segments divided by bridges/causeways.


North of Courtney Campbell

We assemble the EPC data and fit a GAM to model the chlorophyll-Secchi relationship over time. The model shows good diagnostics.

  # join TN load data to chlorophyll data
  epc1 <- epcwq2[ which( epcwq2$site %in% c(46,47,60,61,62,64) ), ]  # subset EPC data by station
  epc1.chl <- epc1[ which( epc1$param=="Chla" ), ]
  epc1.chl <- epc1.chl |> group_by(date) |> summarize(value=mean(value)) |> as.data.frame()
  epc1.sec <- epc1[ which( epc1$param=="Secchi" ), ]
  epc1.sec <- epc1.sec |> group_by(date) |> summarize(value=mean(value)) |> as.data.frame()
  gamdat_epc1 <- inner_join( x = epc1.chl, y = epc1.sec, by = 'date' )
  nrow(gamdat_E2)
## [1] 132
  # decimal date
  gamdat_epc1$decdate <- gamdat_epc1$date |> decimal_date()
  # fit gam
  gam6 <- gam(  log10(value.y) ~ te( decdate, value.x, k=11 ), data = gamdat_epc1 )
  # model diagnostics
  par(mfrow=c(2,2))
  gam.check( gam6, rep = 1000 )

## 
## Method: GCV   Optimizer: magic
## Smoothing parameter selection converged after 10 iterations.
## The RMS GCV score gradient at convergence was 8.176381e-08 .
## The Hessian was positive definite.
## Model rank =  121 / 121 
## 
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
## 
##                        k'   edf k-index p-value
## te(decdate,value.x) 120.0  34.2       1    0.46
  summary( gam6 )
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## log10(value.y) ~ te(decdate, value.x, k = 11)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.137672   0.004366   31.54   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                       edf Ref.df     F p-value    
## te(decdate,value.x) 34.19  43.53 4.091  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.371   Deviance explained = 44.7%
## GCV = 0.0061972  Scale est. = 0.005432  n = 285

The plots below visualize the estimated te() coefficient surface and suggest that the association between Secchi depth and chlorophyll concentration (value.x) has changed substantially over time (decdate). Redder areas (between approximately 2000–2005 and 2010–2015) in the heatmap indicate time periods when increased chlorophyll concentrations were more strongly associated with reduced water clarity.

  # plot smooth
  par(mfrow=c(1,2))
  plot.gam( gam6, scheme = 1)
  plot.gam( gam6, scheme = 2)

top

Between Howard Frankland and Courtney Campbell

The model shows good diagnostics.

 # join TN load data to chlorophyll data
  epc2 <- epcwq2[ which( epcwq2$site %in% c(40,41,42,63,65,66) ), ]
 epc2.chl <- epc2[ which( epc2$param=="Chla" ), ]
 epc2.chl <- epc2.chl |> group_by(date) |> summarize(value=mean(value)) |> as.data.frame()
 epc2.sec <- epc2[ which( epc2$param=="Secchi" ), ]
 epc2.sec <- epc2.sec |> group_by(date) |> summarize(value=mean(value)) |> as.data.frame()
 gamdat_epc2 <- inner_join( x = epc2.chl, y = epc2.sec, by = 'date' )
 nrow(gamdat_E2)
## [1] 132
 # decimal date
 gamdat_epc2$decdate <- gamdat_epc2$date |> decimal_date()
 # fit gam
 gam7 <- gam(  log10(value.y) ~ te( decdate, value.x, k=11 ), data = gamdat_epc2 )
 # model diagnostics
 par(mfrow=c(2,2))
 gam.check( gam7, rep = 1000 )

## 
## Method: GCV   Optimizer: magic
## Smoothing parameter selection converged after 6 iterations.
## The RMS GCV score gradient at convergence was 3.698484e-06 .
## The Hessian was positive definite.
## Model rank =  121 / 121 
## 
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
## 
##                        k'   edf k-index p-value
## te(decdate,value.x) 120.0  30.1    1.03    0.65
 summary( gam7 )
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## log10(value.y) ~ te(decdate, value.x, k = 11)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.246518   0.005706    43.2   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                       edf Ref.df     F p-value    
## te(decdate,value.x) 30.14  37.86 9.754  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.563   Deviance explained = 60.9%
## GCV = 0.010418  Scale est. = 0.0092794  n = 285

The plots below suggest that the association between Secchi depth and chlorophyll concentration (value.x) has changed substantially over time (decdate) at this sub-segment. Redder areas in the heatmap indicate time periods when increased chlorophyll concentrations were more strongly associated with reduced water clarity. Although the heatmap indicates gaps in some high-chlorophyll areas of the input space, the temporal pattern in these results noticeably resembles those of the previous model (gam6). Nonetheless, quantitative comparisons across sub-segments require consideration of the te() coefficient surfaces as well as each model’s estimated intercept.

  # plot smooth
  par(mfrow=c(1,2))
  plot.gam( gam7, scheme = 1)
  plot.gam( gam7, scheme = 2)

top

Between Gandy Blvd and Howard Frankland

The model shows good diagnostics.

  # join TN load data to chlorophyll data
  epc3 <- epcwq2[ which( epcwq2$site %in% c(38,50,51,67) ), ]
  epc3.chl <- epc3[ which( epc3$param=="Chla" ), ]
  epc3.chl <- epc3.chl |> group_by(date) |> summarize(value=mean(value)) |> as.data.frame()
  epc3.sec <- epc3[ which( epc3$param=="Secchi" ), ]
  epc3.sec <- epc3.sec |> group_by(date) |> summarize(value=mean(value)) |> as.data.frame()
  gamdat_epc3 <- inner_join( x = epc3.chl, y = epc3.sec, by = 'date' )
  nrow(gamdat_E2)
## [1] 132
  # decimal date
  gamdat_epc3$decdate <- gamdat_epc3$date |> decimal_date()
  # fit gam
  gam8 <- gam(  log10(value.y) ~ te( decdate, value.x, k=11 ), data = gamdat_epc3 )
  # model diagnostics
  par(mfrow=c(2,2))
  gam.check( gam8, rep = 1000 )

## 
## Method: GCV   Optimizer: magic
## Smoothing parameter selection converged after 8 iterations.
## The RMS GCV score gradient at convergence was 1.457851e-07 .
## The Hessian was positive definite.
## Model rank =  121 / 121 
## 
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
## 
##                        k'   edf k-index p-value
## te(decdate,value.x) 120.0  22.6    1.05    0.86
  summary( gam8 )
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## log10(value.y) ~ te(decdate, value.x, k = 11)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.313710   0.005222   60.07   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                       edf Ref.df     F p-value    
## te(decdate,value.x) 22.55  28.58 10.38  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.508   Deviance explained = 54.7%
## GCV = 0.0084728  Scale est. = 0.0077726  n = 285

The plots below suggest that the association between Secchi depth and chlorophyll concentration has changed substantially over time at this sub-segment. Comparisons across sub-segments require consideration of the te() coefficient surfaces as well as each model’s estimated intercept.

  # plot smooth
  par(mfrow=c(1,2))
  plot.gam( gam8, scheme = 1)
  plot.gam( gam8, scheme = 2)

top


References

Consulting reports and peer-reviewed studies:

Greening H & Janicki A (2006). Toward reversal of eutrophic conditions in a subtropical Estuary: Water quality and seagrass response to nitrogen loading reductions in Tampa Bay, Florida, USA. Environmental Management 38: 163-178. http://doi.org/10.1007/s00267-005-0079-4.

Greening H, Janicki A, Sherwood ET, Pribble R & Johansson, JOR (2014). Ecosystem responses to long-term nutrient management in an urban estuary: Tampa Bay, Florida, USA. Estuarine, Coastal and Shelf Science http://dx.doi.org/10.1016/j.ecss.2014.10.003

Janicki A & Wade D (1996). Estimating critical external nitrogen loads for the Tampa Bay Estuary: An empirically based approach to setting management targets, Final Report.

Morrison G, Janicki A, Wade D, Martin J & Vargo G (1996). Estimated nitrogen fluxes and nitrogen-chlorophyll relationships in Tampa Bay, 1985-1994. Tampa Bay Area Study Group Project Reports: 128. https://digitalcommons.usf.edu/basgp_report/128.

Software:

Grolemund G, Wickham H (2011). Dates and Times Made Easy with lubridate. Journal of Statistical Software 40(3), 1-25. https://www.jstatsoft.org/v40/i03/.

R Core Team (2023). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/.

Wickham H (2011). The Split-Apply-Combine Strategy for Data Analysis. Journal of Statistical Software 40(1): 1-29. URL https://www.jstatsoft.org/v40/i01/.

Wickham H, Francois R, Henry L, Muller K, Vaughan D (2023). dplyr: A Grammar of Data Manipulation. R package version 1.1.2, https://CRAN.R-project.org/package=dplyr)

Wickham H, Vaughan D, Girlich M (2023). tidyr: Tidy Messy Data. R package version 1.3.0, https://CRAN.R-project.org/package=tidyr.

Wood, SN (2017) Generalized Additive Models: An Introduction with R (2nd edition). Chapman and Hall/CRC.