Digital Data 2024: Workshop Report

Data Cleaning for Maximum Impact: Tools and Workflows for Data Providers to Efficiently Find and Fix Data Quality Issues

Workshop Organizing Committee: Arctos (Teresa Mayfield-Meyer), iDigBio (Cat Chapman), NEON (Chandra Earl), Specify (Grant Fitzsimmons), Symbiota (Katie Pearson, Lindsay Walker), and TaxonWorks (Debbie Paul)

Katie Pearson guided workshop participants through strategies for cleaning their specimen data managed in Symbiota portals.

The Symbiota Support Hub led the coordination of one workshop at iDigBio’s 8th annual Digital Data in Biodiversity Research Conference held in Lawrence, KS, May 29-31, 2024. This 3-hour event happened on the third day of the meeting (Friday, May 31, 2024) and involved in-person and online participation, including a cohort of Biodiversity Data Capacity Fellows. The goal of this workshop was to equip participants with tools and workflows that they can use to easily find and fix common data quality issues. Participants gained hands-on experience with tools in GBIF, collections management systems, and other platforms with the ultimate goal of improving data quality for publicly-shared datasets. The Biodiversity Data Capacity Fellow grants broadened participation in the conference, increasing the capacity of a larger portion of the biodiversity data-providing community.

Summary & Rationale

While digitization of biocollections continues across the globe, many collections have already mobilized large volumes of biodiversity specimen data into the open data ecosystem. As with all datasets, biocollections data often suffer from data quality issues that arise from the complexities of the digitization and mobilization processes, and these data quality issues can prevent the data from reaching its full potential for research and educational impact. Often, the people best positioned to clean these data at the source (e.g., collections and data managers) do not have the capacity to efficiently do so.

This workshop aimed to improve data providers’ capacity by providing training using a toolbox of data quality checks and fixes that can be easily deployed. Data mobilization and data quality experts–including representatives of iDigBio, GBIF, Specify, the Symbiota Support Hub, TaxonWorks, and Arctos–collaboratively created a series of “tool kits” and provided hands-on training for identifying and fixing common data quality issues using different tools (e.g., collection management systems). The workshop was conducted in a hybrid format with some organizers in-person and some organizers managing the online audience and questions answered from both audiences centrally.

Having the right participants at the workshop was essential to meet the goal of increasing community capacity. However, many potential participants–such as students, collections managers, data managers, and digitization technicians–lack sufficient travel funding to attend conferences in person. For this reason, the Organizing Committee coordinated iDigBio’s sponsorship for the travel of 20 participants to the workshop and the Digital Data Conference by dispersing $1000 “Biodiversity Data Capacity Fellow” grants (total budget: $20,000) to be used for travel. Workshop participation was only one of many benefits for these participants, who were expected to engage in other Digital Data activities and many of whom presented in other sessions at the conference.

These travel grants were advertised via relevant listservs, email lists, and other outlets (example). Selection for travel funding was based on (1) a statement of interest in the workshop and conference (including the applicant’s role at their institution), (2) a statement of existing financial support, and (3) whether the applicant has previously been funded by iDigBio to participate in an event. Preference was given to applicants with high interest, relevant positions and institutions (i.e., collection manager, data manager, technician, digitization student), low existing financial support, and no/little previous funding by iDigBio. Grant recipients were required to submit a short report summarizing their participation after the conference to encourage them to reflect on what they learned.

Workshop Agenda

TimeWorkshop Activity
10:35 AMWelcome, introduction, and orientation to the workshop
10:40 AMA Data Quality Checklist: resources for finding and fixing data quality errors. [Wiki]
11:00 AMData quality from the aggregator perspective (data quality flags, existing tools): Presentations from iDigBio (Cat Chapman) and GBIF (John Waller)
11:20 AMExplore your data (or another dataset) in GBIF and iDigBioQ&A + Discussion
11:45 PMLunch Break & Networking
Morning workshop agenda
Breakouts I (Room D)Breakouts II (Room C)
12:45 PMFinding and fixing data quality issues in a Symbiota instance (Katie Pearson)Exploring Data Management and Curation in TaxonWorks (Debbie Paul & Tommy McElrath)
1:30 PMFinding and fixing data quality issues in your Specify instance (Grant Fitzsimmons)Arctos Data Quality Tools (Teresa Mayfield-Meyer)
2:15 PMWrap-up: What have we learned?
Afternoon workshop agenda

Materials Produced

Data quality “tool kits” were produced for the following platforms:

PlatformResource
Arctoshttps://www.idigbio.org/wiki/index.php/Arctos_Data_Quality_Toolkit 
Excelhttps://www.idigbio.org/wiki/index.php/Excel_Data_Quality_Toolkit 
iDigBio Wikihttps://www.idigbio.org/wiki/index.php/Data_Quality_Toolkit_2024
Specifyhttps://www.idigbio.org/wiki/index.php/Specify_Data_Quality_Toolkit
Symbiotahttps://biokic.github.io/symbiota-docs/editor/quality
TaxonWorkshttps://docs.taxonworks.org/guide/data-quality.html
Data Cleaning Workshop (Morning Session)
Data Cleaning Workshop (Afternoon Session, Breakouts I/Room D)

Participants Engaged

The cohort of 20 Biodiversity Data Capacity Fellows was selected by the workshop’s organizing committee to represent individuals from a diversity of collections—including herbaria, invertebrates and vertebrate collections, and even diatoms–who had little to no prior funded engagement with the iDigBio community through attendance at workshops or TCN/PEN participation. iDigBio, the Symbiota Support Hub, and the workshop’s organizing committee sincerely thank the Fellows for their participation in the workshop and conference.

Digital Data 2024 Biodiversity Data Capacity Fellows:
FellowAffiliation
Aidan HoulihanAcademy of Natural Sciences at Drexel University
Audrey KurzBishop Museum
Caitlin BloomerUniversity of Illinois
Chelsea SmithAcademy of Natural Sciences at Drexel University 
Christiana MojicaSan Diego Museum of Natural History
Csilla CzakóSouth Carolina Heritage Trust
David KunkelOklahoma State University Herbarium
Harpo FaustUniversity of New Mexico
Isabelle HudsonUniversity of Connecticut
Jeremy CowanCheadle Center for Biodiversity and Ecological Restoration, UC Santa Barbara
Joseph MohanUniversity of California, Irvine
Kelcie BrownNew York Botanical Garden
Lindsey WorcesterMorton Arboretum
Litsa WootenMuseum of Southwestern Biology
Matthew SheikDenver Botanic Gardens
Michael ThomasNational Herbarium of Rwanda, National Herbarium of Bhutan (U.S.-based)
Sarah De GrootKansas State University
Stelios ChatzimanolisUniversity of Tennessee, Chattanooga
Tommy McElrathIllinois Natural History Survey
Valerie WarholCarnegie Museum of Natural History

Participant Feedback

As a follow-up exercise to meeting attendance, each Biodiversity Capacity Fellows submitted a reflective summary of their experience at the conference to iDigBio and the Organizing Committee. Overall, these reports were very positive and included constructive feedback for consideration when planning future in-person events. Representative reflections on the Data Cleaning Workshop included:

“I found the toolkit and associated resources very helpful since the data at our museum lives in so many formats (the entomology collection is in Specify, malacology in Symbiota, and Vert Zoo in Excel to name a few). The toolkit is well organized and very detailed with plenty of links out, so I know I will find myself referencing it plenty as I work through our data.”

“Overall, this conference gave me a deeper appreciation for collections and made me think of how I can be a better steward of the data that I am using, so that it is useable by more than just myself.”

“I really enjoyed the data cleanup workshop for Symbiota, I was able to clean/correct close to 1000 records while attending the workshop and the hour after it.”

“The Data Cleaning workshop was my main priority for visiting this conference, and it did not disappoint. I left this workshop with countless tools and resources to improve the data cleaning process for both my herbarium’s digitizing process and any future data work I do… It was an honor to be chosen for the Digital Data Capacity Fellowship, which made it possible for me to attend in the first place.”

“The data cleaning workshop was enlightening, revealing a great variety of data management systems and methodologies for data cleanup far beyond my expectations. This variety, although initially overwhelming, is reassuring as it provides multiple approaches to data cleaning and management. The point you made—that our goal is not to perfect our data but to fix issues that limit discoverability—was especially helpful, as it put into perspective what is really important when it comes to managing data.”

Acknowledgements 

The Symbiota Support Hub extends its sincere appreciation to the workshop’s Organizing Committee, the Biodiversity Capacity Fellows, and the workshop’s presenters and facilitators—in-person, hybrid, and remote!—for their participation in this workshop, and to iDigBio (NSF #2027654) for providing the financial and logistical support that made it possible.