Below are several resources and definitions that may be helpful in developing a data management plan. This is by no means an exhaustive list. These are resources that we have either curated specifically for TBEP or resources we’ve found helpful in our own journey.
- Software Carpentry: R for Reproducible Scientific Analysis
- Data Carpentry: Geospatial Workshop
- Data Carpentry: R for Data Analysis and Visualization of Ecological Data
- Data Carpentry: Data Organization in Spreadsheets
- Reproducible Reporting with R (R\(^3\)) for Marine Ecological Indicators
- R for Water Resources Data Science
- RStudio Webinars, many topics
- R For Cats: Basic introduction site, with cats!
- Topical cheatsheets from RStudio, also viewed from the help menu
- Cheatsheet from CRAN of base R functions
- Totally awesome R-related artwork by Allison Horst
- Color reference PDF with text names, Color cheatsheet PDF from NCEAS
- Jenny Bryan’s Stat545.com
- Garrett Grolemund and Hadley Wickham’s R For Data Science
- Chester Ismay and Albert Y. Kim’s Modern DiveR
- Julia Silge and David Robinson Text Mining with R
- Hadley Wickham’s Advanced R
- Hadley Wickham’s R for Data Science
- Yihui Xie R Markdown: The Definitive Guide
- Winston Chang R Graphics Cookbook
- Wegman et al. Remote Sensing and GIS for Ecologists: Using Open Source Software
- Lovelace et al. Geocomputation with R
- Edszer Pebesma and Roger Bivand Spatial Data Science
CI/CD: Continuous Integration/Continuous Deployment, describes web services that can be used to automate data checks, tests, or workflows. These are often integrated into existing platforms, such as GitHub Actions on GitHub.
Dashboard: Interactive and dynamic user interfaces available online that can be created to facilitate understanding or to inform decision-making. Many flavors exist, including R Shiny, Python Dash, or ArcGIS storymaps.
Data: A general term describing a variety of informational products that can support environmental decision-making or be used to test research hypotheses. Data can be as simple as a single spreadhsheet or more complex products such as model output or parameters.
Database: An organized collection of data, where each piece of data can be linked and accessed electronically through keys that act as unique identifiers for units of information. These are often called relational databases.
Data Dictionary: An informal description of the information included in a dataset, often used to place boundaries on expected values or data types. This includes column names, types of data stored in each column, and expected values for each data type.
DOI: Digital Object Identifier, a unique and permanent name (persistent identifier) for a dataset or other resource, used for archiving and discovery through online queries. Services like Zenodo can be linked to GitHub to create a DOI for a repository.
FAIR: Findable, Accessible, Interoperable, and Reusable, describes general guidelines for creating open data products or assessing the openness of existing products (Wilkinson et al. 2016).
Flat File: The simplest form of tabular data, often as a rectangular grid of information stored as ASCII text in a non-proprietary file format. There is no information stored in each cell other than the observational values.
Keys: Identifiers that can be used to link data between tables. They are often used to identify unique rows of data, such as a station name or station name/date combination.
Federated Repository: An online network of connected repositories that use similar standards to collectively store data for discovery and access. Uploading a dataset to one node of a repository will make it available through all other nodes.
Metadata: A suite of industry or disciplinary standards as well as additional internal and external documentation and other data necessary for the identification, representation, interoperability, technical management, performance, and use of data contained in an information system (Gilliland 2016). Simply put, the who, what, when, where, why, and how of data.
Model: A general term describing a theoretical representation of a real-world phenomenon. It can be as simple as a statistical linear regression model (i.e., y varies as a function of x in a linear fashion) or a more involved mechanistic model with linked equations that describe real-world processes occurring through space and time.
Open Science: A philosophy and set of tools to make research reproducible and transparent, in addition to having long-term value through effective data preservation and sharing (Beck et al. 2020).
Open Source: Software with code that is freely available under a license that typically grants users the rights to modify or distribute to others for any purpose. The R statistical programming language is one example of open source software used in the environmental sciences.
Provenance: The history of a dataset, including its origin, purpose, and metadata. Formally, this can include the records of inputs, software, and steps of analysis used to create a dataset. The intent is to establish context and also allow reproducibility.
Synthesis: The collection and combination of datasets from different sources, often for research or to inform decision-making. The synthesis product may be considered a novel dataset in itself if the steps in its creation produce novel information not available from its source data.
Tidy Data: A set of simple rules for storing tabular data, including each variable in its own column, each observation in its own row, and each value in its own cell (Wickham 2014)
Version Control: A formal software or code development system that tracks and documents changes to create a record that describes the development history and that can be accessed at any time so that previous changes are never lost. Git is version control software, as compared to GitHub which is an online platform for hosting projects using Git.
The following includes standard language and links to SOP sections that can be used in competitive requests for proposal (RFPs) issued by the TBEP to support use of open science and best practices for data management as described in this SOP. This language can be used as an expression for partners to encourage use of descriptions and approaches in this document for competitive criteria used to assess proposal qualifications.
Respondents should demonstrate adequate hardware and software capabilities and personnel trained to receive, store, disseminate, print, summarize, and transmit open standard GIS and statistical package files. All final datasets, statistical assessments, data reduction methods and syntheses should be delivered to the TBEP in open standard formats using open science frameworks, such as those developed through the RMarkdown language, R or Rstudio statistical analysis platforms, Python scripts, or the QGIS environment. All data products and spatial coverages will include appropriate Federal Geographic Data Committee (FGDC) standards, Ecological Metadata Language (EML) compliant metadata, or other descriptive information appropriate for metadata (3.4).
Respondents must demonstrate an understanding of tidy data formats as applied to tabular data and an ability to deliver data products in non-proprietary file formats (e.g., csv flat files, shapefiles, or KML). Use of online sharing platforms that enable data products to be findable, accessible, interoperable, and reusable (3.2) are strongly encouraged. These can include formal data repositories that employ standardized metadata schema (e.g., DataOne). Small datasets (<100 Mb) can also be hosted on the TBEP GitHub website (https://github.com/tbep-tech) by contacting staff for access. View the guidance here for other acceptable alternatives.
The following includes some standard language and links to SOP sections that can be used in contracts for deliverables for data products. The text here describes general expectations of how data products should be delivered to TBEP in as open a format as possible following concepts and descriptions of approaches in this document. The intent is to contractually obligate our partners to adopt best practices for data management following an open science ethos.
All data deliverables are to be licensed under permissive terms that enable reuse in the public domain. This could include the MIT or CC0 licenses provided as stand-alone text files included with data deliverables. Exceptions to the permissive reuse terms in these licenses will be handled on a case by case basis, e.g., if data are sensitive or include information on human subjects. Additional guidance on choosing a license can be found here.
An approach for data management should be identified as soon as possible after project inception (4.2). This will include identifying appropriate data deliverables at the beginning of the project using guidance in the SOP (3.1). Plans for documenting and delivering identified data products to the TBEP should also be identified, which primarily includes appropriate data formats (3.3), metadata documentation (3.4), and where data are hosted (3.5).
All tabular data are to be provided in a tidy format to the extent possible. This includes following simple rules for delivering data as 1) each variable in its own column, 2) one observation per row, and 3) one value per cell (3.3). Tidy data must be delivered in a non-proprietary or open-source compatible format, such as .csv flat files. Datasets using .xlsx file formats or that require proprietary software to access (e.g., SAS) will not be accepted unless project managers can provide justification for not using open formats.
Geospatial data must conform to Federal Geographic Data Committee (FGDC) standards and other open source formats as compatible with the Geographic Data Abstraction Library (GDAL) for spatial analysis (this includes most ESRI data products). Horizontal and vertical datums must be explicitly specified in any metadata, as well as any information describing attribute data accompanying spatial information. Appropriate datums should limit spatial distortion to the region of interest (e.g., NAD83 / Florida West (ftUS) for Florida).
Source code used to generate a data product or analyze/evaluate a data product can be supplied as a text file to allow reproducibility of information and conclusions relevant for the project. To the extent possible, this will include identification of data inputs required for the analysis software, dependent software packages (and versions) used for analysis, and generated output from the source code. This information can be identified with inline comments in the source code. Use of version control software as hosted on GitHub is strongly encouraged for the delivery of source code. Project managers can work with TBEP staff to create a shared repository at https://github.com/tbep-tech.
All data products are to be accompanied with appropriate metadata. Formal metadata schema can be used following the EML standard (preferred, 3.4.2) or as more general text files (acceptable but not preferred). A workflow for metadata generation as described in the SOP can be used (4.2.4). Plain text descriptions for metadata can include data dictionaries to aid in the interpretation of data deliverables (3.4.3)
Data can be delivered through online sharing platforms that enable products to be findable, accessible, interoperable, and reusable (3.2). These can include formal data repositories that employ standardized metadata schema (e.g., DataOne). Small datasets (<100 Mb) can also be hosted on the TBEP GitHub website (https://github.com/tbep-tech) by contacting staff for access. View the guidance here for other acceptable alternatives. All data products should have an associated DOI as a globally unique and persistent identifier (through https://zenodo.org/). An appropriate license should also be attached to the data.