How Report Card Scores Are Calculated

Overview

Seagrass transects are monitored in Tampa Bay each year by multiple resource management agencies. To support these efforts, the Tampa Bay Estuary Program coordinates an annual intercalibration training prior to the field season. This intercalibration effort ensures that all participating groups are using consistent methods and interpretations. Seagrasses are independently assessed by each group at the same locations and the results are compared using report cards. The intent of these report cards is to highlight areas where each group is doing well and where efforts could be improved to ensure consistency. This document describes how the report cards are prepared and what the scores mean.

Because there is no single external ground truth, scores are based on how consistently groups agree with each other. A group that reports values close to the cross-group average for the year earns a high score and one that deviates substantially earns a lower score.

Scores are calculated for three field measurements for each seagrass species:

Abundance: Braun-Blanquet (BB) cover category (0 = No cover, 0.1 = Solitary, 0.5 = Few, 1 = <5%, 2 = 5-25%, 3 = 25-50%, 4 = 51-75%, 5 = 76-100%)
Blade Length: average blade length in cm
Short Shoot Density: shoots per m²

Whether each group correctly identified which species were present is also factored into the abundance score. These three metric scores are averaged into an overall total score, which is then converted to a letter grade.

Note that “transect” is used herein to describe the intercalibration sites during training, which are quadrats at fixed locations. For the actual transect sampling, sites are set meter marks along a transect line.

The following steps below outline the key methods used to calculate group scores. The main points are:

Species identified by at least two groups count towards the “true” species list.
Scores are based on consensus among groups. Groups that deviate from the consensus (using deviation from the average) receive lower scores.
Scores consider variability across transects. Deviations for a metric where variability is high across transects aren’t weighed as heavily when calculating the score. Likewise, deviations for metrics with lower variability are penalized more heavily.
Overall scores are mapped to letter grades using a traditional scoring curve. The group with the smallest deviations from the averages automatically gets the highest grade and the group with the largest deviations automatically get the lowest grade. However, the floor of the spread can be raised in years when all groups perform similarly.

Step 1: The Consensus Species List

Not every species recorded across all groups counts toward scoring. A species at a given transect (site/quadrat) is considered present if at least two distinct groups reported it with non-zero abundance. This removes likely misidentifications while still being inclusive, i.e., a species does not need to be unanimous, just corroborated. Species identifications include seagrass and macroalgae, where the only the former is used to assess abundance, blade lengths, and short shoot density.

Table 1: Consensus species at each transect in 2025 as species reported by at least two groups.

Transect	Species count (richness)	Species
3	2	Halodule, Thalassia
6	2	DR: Chondria, Thalassia
7	5	DR: Acanthophora, DR: Chondria, Halodule, Syringodium, Thalassia
8	3	DR: Acanthophora, Halodule, Thalassia
9	4	DR: Chondria, DR: Hypnea, Syringodium, Thalassia
10	1	Halodule

The consensus list is created per transect, so a species may be on the list at one transect but not another.

Step 2: True Values

For each consensus species at each transect, true values are calculated as the cross-group average:

Abundance: group Braun-Blanquet values (0, 0.1, 0.5, 1, 2, 3, 4, 5) are averaged directly across groups, then snapped to the nearest valid BB value (no coverage, solitary, few, <5%, 5-25%, 26-50%, 51-75%, 76-100%).
Blade Length and Short Shoot Density: simple means across groups.

Only consensus species enter this calculation. This prevents a misidentified species (recorded by only one group) from distorting the true values for everyone else.

Table 2: Example “true” values at one transect in 2025 determined by cross-group averages for consensus species.

Transect	Species	Abundance (0, 0.5, 1, 2, 3, 4, 5)	Blade Length (cm)	Short Shoot Density (per m²)
8	DR: Acanthophora	1	NA	NA
8	Halodule	3	21.3	5.5
8	Thalassia	2	28.5	1.9

Step 3: Group Deviations and Species ID Penalties

For each group, reported values are compared to the true values across all consensus species and transects. This comparison reveals two types of species identification errors:

Missed species: a species is present but the group did not record it. The group’s abundance for that species is treated as “no coverage” (BB value 0) when computing the deviation. The penalty scales with the true abundance of the missed species: missing a rare species is a small error, while missing a dominant one is a large error.

False positives: the group recorded a species that is not on the consensus list (i.e., no other group confirmed it). The true abundance is treated as “no coverage” (BB value 0). The penalty scales with how high the group’s reported abundance was for that species.

These penalties only affect the abundance score. Blade Length and Short Shoot Density cannot be meaningfully penalised for species that were not found.

For Abundance, deviations are computed as differences in ordinal category positions rather than in raw Braun-Blanquet numeric values. The eight BB categories are treated as equally spaced steps (1 = no coverage through 8 = 76–100%), so a deviation of 1 means the group was one category apart from the true value, regardless of the numeric gap between those categories on the BB scale. This implicitly assumes that there is similar difficulty in distinguishing between lower abundances (e.g., solitary vs few) and higher abundances (e.g., 51-75% vs 76-100%).

Table 3: Reported vs. true values for group “C” at one transect in 2025. Note the missed species in row 1 and the false positive in row 2.

Transect	Species	Abundance reported	Abundance true	Blade Length reported	Blade Length true	Short Shoot Density reported	Short Shoot Density true
8	DR: Acanthophora	—	<5%	NA	NA	NA	NA
8	DR: Laurencia	few	—	NA	NA	NA	NA
8	Halodule	5-25%	25-50%	15.2	21.3	4.7	5.5
8	Thalassia	5-25%	5-25%	21.8	28.5	1.3	1.9

Step 4: Metric Scores

For each metric (abundance, blade length, short shoot density), deviations from the true value are summarised per species across transects, then combined into a single number per group per metric. The combination uses a weighted mean of absolute differences, where the weight for each species is the inverse of the standard deviation of the true values across transects:

\[ \text{metric score}_{\text{raw}} = \frac{N}{D} \]

where \(N\) is the weighted sum of absolute deviations across species:

\[ N = \sum_s w_s\,|\bar{d}_s| \]

and \(D\) is the total weight across species:

\[ D = \sum_s w_s \]

For both \(N\) and \(D\), the \(w_s\) component is the per-species weight:

\[ w_s = \frac{1}{1 + \sigma_s} \]

\(\bar{d}_s\) is the mean absolute deviation for species \(s\) and \(\sigma_s\) is the standard deviation of the true values for that species across transects. It is important to note that \(\sigma_s\) reflects the spatial variability of a species’ true abundance across transects. It is a property of the site, not of the group’s performance. A group can deviate substantially from a species whose true value is identical at every transect (\(\sigma_s = 0\)), because the two quantities are independent: \(\sigma_s\) describes how consistent the correct answer is across space, while \(|\bar{d}_s|\) describes how far the group was from that answer.

The units of \(\sigma_s\) differ by metric. For Abundance it is in ordinal category positions (the same scale as \(|\bar{d}_s|\)), while for Blade Length and Short Shoot Density it is in actual measurement units (cm and shoots per m², respectively). When \(\sigma_s\) cannot be computed, e.g., when a species is present at only one transect and the standard deviation is undefined, giving that species a weight of 1 (full weight in the scoring).

Species where the true value varies a lot across transects (high \(\sigma_s\)) receive a lower weight \(w_s\), so their deviations contribute less to \(N\). Dividing by \(D\) normalises the result so it stays on the scale of a typical deviation regardless of how many species are assessed.

Table 4: Per-species deviation summary of the abundance metric across sites for group ‘C’ in 2025.

Species	Reported (avg)	True (avg)	Mean deviation
Halodule	5-25%	25-50%	-1
Syringodium	<5%	<5%	0
Thalassia	<5%	5-25%	-1

To see how these combine into a raw metric score, consider the Abundance values for group C in from Table 4. Taking the absolute values of the mean deviations, the 3 species have \(|\bar{d}_s|\) of 1, 0, 1 and \(\sigma_s\) of 0, 1, 1 (as variation across all groups for each species, not shown in the table). The weights \(w_s = 1/(1+\sigma_s)\) are 1, 0.5, 0.5, giving per-species contributions \(w_s|\bar{d}_s|\) of 1, 0, 0.5. Summing across species:

\[N = 1 + 0 + 0.5 = 1.5\]

\[D = 1 + 0.5 + 0.5 = 2\]

The raw Abundance score is \(N/D = 1.5/2 = 0.75\), matching the Abundance value shown in Table 8 (see a full worked example below). This raw score is then scaled to a numeric score on a 0–100 scale using the range of scores across groups, which is then converted to a letter grade (see next steps).

When weighting matters

The weight \(w_s\) produces a more appropriate score than a simple unweighted mean. Whether the weighted N/D ends up lower or higher than the unweighted N/D depends on whether the high-\(\sigma_s\) species carry above-average or below-average deviations. The two scenarios below use two hypothetical species to illustrate both cases.

Scenario 1: weighting reduces (improves) the raw score

A group deviates little from a consistent species (\(\sigma = 0\)) but substantially from a highly variable one (\(\sigma = 3\)). The weight discounts Species B’s large deviation because natural variability in true values across transects makes precise agreement inherently harder.

Table 5: Scenario 1 showing large group deviation on the variable species. Weighted N/D (blue) is much lower than unweighted N/D (red), reflecting an appropriately reduced penalty for missing a naturally hard-to-pin-down species.

	σ_s	\|d̅_s\|	w_s	Weighted w_s × \|d̅_s\|	Unweighted \|d̅_s\|
Species A (consistent, σ = 0)	0	0.5	1.00	0.50	0.50
Species B (variable, σ = 3)	3	4.0	0.25	1.00	4.00
N				1.50	4.50
D				1.25	2.00
N / D				1.20	2.25

Without weighting, Species B’s deviation of 4.0 dominates the score (unweighted N/D = 2.25). With weighting, that deviation is reduced in influence because Species B is naturally variable, yielding a weighted N/D of 1.20, a substantially better score. The group missed a hard target, and the scoring system appropriately reduces the penalty for that miss.

Scenario 2: weighting raises (penalizes) the raw score

Now the same group deviates heavily from the consistent species but nearly matches the variable one. Because agreement on a consistent species carries more information, the weighted mean penalizes the miss more harshly than the unweighted mean would.

Table 6: Scenario 2 showing large deviation on the consistent species. Weighted N/D (blue) is higher than unweighted N/D (red), appropriately penalising a large miss on the species where all groups should agree.

	σ_s	\|d̅_s\|	w_s	Weighted w_s × \|d̅_s\|	Unweighted \|d̅_s\|
Species A (consistent, σ = 0)	0	3.0	1.00	3.000	3.00
Species B (variable, σ = 3)	3	0.5	0.25	0.125	0.50
N				3.125	3.50
D				1.250	2.00
N / D				2.500	1.75

Without weighting, Species B’s near-zero deviation helps dilute the score (unweighted N/D = 1.75). With weighting, that good performance on a naturally variable species counts for less, yielding a weighted N/D of 2.50. This is the appropriate outcome since a large miss on a species where all groups should agree is a genuine measurement error and should not be offset by being close on a species that varies widely by nature.

The weight \(w_s = 1/(1+\sigma_s)\) ensures that consistent species where cross-group agreement is expected drive the score more than naturally variable species where some disagreement is expected regardless. The weighted mean does not systematically produce better or worse scores than a simple average. Whether it is higher or lower depends on the data. What it always does is produce scores that more accurately reflect genuine measurement quality.

Step 5: Score Calibration

Raw metric scores are converted to a 0–100 scale. Without calibration, the best group in any year always maps to 100 and the worst always maps to 50 (a fixed minimum score), regardless of how closely groups agreed. This means a year where everyone performed very well would still produce a spread from A to D, which needs to be accounted for to avoid an unfair outcome.

To address this, the score floor (the minimum possible score) is raised in years when all groups agree closely with each other, and kept at 50 in years when disagreement is high.

How calibration works

For each year and metric, we compute the within-year standard deviation of group deviations to assess the spread between groups. This is then expressed as a ratio to the historical mean spread across all training years. A ratio below 1 means groups agreed more closely than usual and a ratio above 1 means more disagreement than usual.

The score floor for a given year and metric is:

\[ \text{floor}_{\text{year}} = \max\!\left(50,\ 50 + \left(1 - \frac{\text{SD}_{\text{year}}}{\overline{\text{SD}}}\right) \times 50\right) \]

where \(\text{SD}_{\text{year}}\) is the within-year spread for that metric and year, \(\overline{\text{SD}}\) is its historical mean, and 50 is a scaling constant that sets the maximum possible floor lift (grade-points). The scaling constant is a subjective choice. Larger values compress scores toward each other in tight years, while smaller values preserve more spread. The maximum lift occurs when all groups agree perfectly (ratio = 0), giving a floor of \(50 + 50 = 100\), which corresponds to a minimum grade of A (all groups performed exactly the same). A year at the historical average (ratio = 1) receives no adjustment and keeps a floor of 50. Loose years (ratio > 1) are capped at 50 so that no extra penalty is applied beyond the standard range.

Figure 1: Within-year spread of group deviations for each metric and year. Bar colour shows the ratio of each year’s spread to the historical mean (dashed line). Bar labels show the ratio and the resulting score floor. Blue (ratio < 1) years are tighter than average and receive a higher score floor.

Effect on scores: tight vs. loose year

The table below compares the calibrated score floor for each year and metric, illustrating how the floor shifts in tighter training years.

Table 7: Calibrated score floor by year and metric. Years with consistently tight agreement receive a higher floor, raising the minimum grade for all groups in that year.

Year	Abundance	Blade Length	Short Shoot Density
2020	50	78	50
2021	52	50	73
2022	50	50	50
2023	50	66	73
2024	50	64	83
2025	59	50	96

Step 6: Letter Grades

After calibration, each group’s numeric score for each metric falls on a 0–100 scale, with the lowest possible score in a given year defined by the calibrated score floor. These are mapped to letter grades using fixed thresholds:

Table 8: Letter grade thresholds.

Grade	Score range
A	95 – 100
A-	90 – 94
B+	85 – 89
B	80 – 84
B-	75 – 79
C+	70 – 74
C	65 – 69
C-	60 – 64
D+	55 – 59
D	below 55

The Total score is the unweighted average of the Abundance, Blade Length, and Short Shoot Density numeric scores, then converted to a letter grade using the same thresholds.

Worked Example

The following walks through the full scoring workflow for group C in 2025.

Raw deviations

Table 9: Per-species deviations across all metrics for group C in 2025.

Species	Reported avg	True avg	Mean deviation
Abundance
Halodule	5.0	6.0	-1.0
Syringodium	4.0	4.0	0.0
Thalassia	4.0	5.0	-1.0
Blade Length
Halodule	15.6	13.2	2.4
Syringodium	21.1	13.9	7.2
Thalassia	20.4	18.0	2.4
Short Shoot Density
Halodule	8.3	7.2	1.1
Syringodium	0.3	0.7	-0.4
Thalassia	1.3	1.6	-0.3

Scores

For each metric, the per-species rows show the mean absolute deviation (\(|\bar{d}_s|\)), the SD of true values across sites (\(\sigma_s\)), and each species’ contribution to the weighted mean raw score (\(w_s\) and \(|\bar{d}_s| x w_s\)). The “Raw score” row shows the weighted mean (computed using the formula in Step 4), the calibrated score floor, the rescaled numeric score, and the letter grade.

Table 10: Score derivation for group C in 2025. Per-species rows show the elements of the weighted mean formula from Step 4. The ‘Raw score’ row shows the weighted mean and its conversion to a calibrated numeric score and letter grade via the score floor.

Species	\|d̅_s\|	σ_s	w_s = 1/(1+σ_s) (sums to D)	\|d̅_s\| × w_s (sums to N)	Raw score (N/D)	Score floor	Numeric score	Letter grade
Abundance
Halodule	1.00	0.00	1.00	1.00	—	—	—	—
Syringodium	0.00	1.00	0.50	0.00	—	—	—	—
Thalassia	1.00	1.00	0.50	0.50	—	—	—	—
Raw score (weighted mean)	—	—	2.00	1.50	0.750	59	59.2	D+
Blade Length
Halodule	2.40	11.50	0.08	0.19	—	—	—	—
Syringodium	7.20	9.80	0.09	0.67	—	—	—	—
Thalassia	2.40	10.30	0.09	0.21	—	—	—	—
Raw score (weighted mean)	—	—	0.26	1.07	4.102	50	87.0	B+
Short Shoot Density
Halodule	1.10	2.50	0.29	0.31	—	—	—	—
Syringodium	0.40	0.00	1.00	0.40	—	—	—	—
Thalassia	0.30	1.30	0.43	0.13	—	—	—	—
Raw score (weighted mean)	—	—	1.72	0.84	0.491	96	97.8	A
Total
Average of three metric scores	—	—	—	—	—	—	81.3	B
For the Raw score row: the w_s column shows D = Σw_s and the \|d̅_s\| × w_s column shows N = Σ(\|d̅_s\| × w_s). Raw score = N/D.

The figure below shows how raw scores map to numeric scores across all groups for each metric. The grey line is the linear rescaling anchored at 100 for the best group and at the score floor for the worst. Group C is highlighted in blue.

Figure 2: Raw score to numeric score rescaling for all groups in 2025 by metric. Each dot is one group and group C (blue) is highlighted with drop lines to both axes. The grey line is the linear mapping from the best group (lowest raw deviation, numeric score 100) to the worst (highest raw deviation, numeric score = score floor). The red dashed line shows the calibrated score floor.

How the calibration affected this group’s scores

Table 11: Effect of calibration on scores for group C in 2025. The floor is the lowest possible score any group could receive in this year for each metric.

Metric	Score without calibration	Score with calibration	Ratio (within-year spread / historical mean)	Score floor
Abundance	50.0	59.2	0.82	59
Blade Length	87.0	87.0	1.30	50
Short Shoot Density	71.6	97.8	0.08	96

A ratio below 1 (tighter than average cohort) raises the floor for all groups, including this one. A floor of 50 means no calibration adjustment was applied.