Data Cleaning for Maximum Impact: Tools and Workflows for Data Providers to Efficiently Find and Fix Data Quality Issues
Workshop Organizing Committee: Arctos (Teresa Mayfield-Meyer), iDigBio (Cat Chapman), NEON (Chandra Earl), Specify (Grant Fitzsimmons), Symbiota (Katie Pearson, Lindsay Walker), and TaxonWorks (Debbie Paul)
The Symbiota Support Hub led the coordination of one workshop at iDigBio’s 8th annual Digital Data in Biodiversity Research Conference held in Lawrence, KS, May 29-31, 2024. This 3-hour event happened on the third day of the meeting (Friday, May 31, 2024) and involved in-person and online participation, including a cohort of Biodiversity Data Capacity Fellows. The goal of this workshop was to equip participants with tools and workflows that they can use to easily find and fix common data quality issues. Participants gained hands-on experience with tools in GBIF, collections management systems, and other platforms with the ultimate goal of improving data quality for publicly-shared datasets. The Biodiversity Data Capacity Fellow grants broadened participation in the conference, increasing the capacity of a larger portion of the biodiversity data-providing community.
Summary & Rationale
While digitization of biocollections continues across the globe, many collections have already mobilized large volumes of biodiversity specimen data into the open data ecosystem. As with all datasets, biocollections data often suffer from data quality issues that arise from the complexities of the digitization and mobilization processes, and these data quality issues can prevent the data from reaching its full potential for research and educational impact. Often, the people best positioned to clean these data at the source (e.g., collections and data managers) do not have the capacity to efficiently do so.
This workshop aimed to improve data providers’ capacity by providing training using a toolbox of data quality checks and fixes that can be easily deployed. Data mobilization and data quality experts–including representatives of iDigBio, GBIF, Specify, the Symbiota Support Hub, TaxonWorks, and Arctos–collaboratively created a series of “tool kits” and provided hands-on training for identifying and fixing common data quality issues using different tools (e.g., collection management systems). The workshop was conducted in a hybrid format with some organizers in-person and some organizers managing the online audience and questions answered from both audiences centrally.
Having the right participants at the workshop was essential to meet the goal of increasing community capacity. However, many potential participants–such as students, collections managers, data managers, and digitization technicians–lack sufficient travel funding to attend conferences in person. For this reason, the Organizing Committee coordinated iDigBio’s sponsorship for the travel of 20 participants to the workshop and the Digital Data Conference by dispersing $1000 “Biodiversity Data Capacity Fellow” grants (total budget: $20,000) to be used for travel. Workshop participation was only one of many benefits for these participants, who were expected to engage in other Digital Data activities and many of whom presented in other sessions at the conference.
These travel grants were advertised via relevant listservs, email lists, and other outlets (example). Selection for travel funding was based on (1) a statement of interest in the workshop and conference (including the applicant’s role at their institution), (2) a statement of existing financial support, and (3) whether the applicant has previously been funded by iDigBio to participate in an event. Preference was given to applicants with high interest, relevant positions and institutions (i.e., collection manager, data manager, technician, digitization student), low existing financial support, and no/little previous funding by iDigBio. Grant recipients were required to submit a short report summarizing their participation after the conference to encourage them to reflect on what they learned.
Workshop Agenda
Time | Workshop Activity | |
10:35 AM | Welcome, introduction, and orientation to the workshop | |
10:40 AM | A Data Quality Checklist: resources for finding and fixing data quality errors. [Wiki] | |
11:00 AM | Data quality from the aggregator perspective (data quality flags, existing tools): Presentations from iDigBio (Cat Chapman) and GBIF (John Waller) | |
11:20 AM | Explore your data (or another dataset) in GBIF and iDigBioQ&A + Discussion | |
11:45 PM | Lunch Break & Networking |
Breakouts I (Room D) | Breakouts II (Room C) | |
12:45 PM | Finding and fixing data quality issues in a Symbiota instance (Katie Pearson) | Exploring Data Management and Curation in TaxonWorks (Debbie Paul & Tommy McElrath) |
1:30 PM | Finding and fixing data quality issues in your Specify instance (Grant Fitzsimmons) | Arctos Data Quality Tools (Teresa Mayfield-Meyer) |
2:15 PM | Wrap-up: What have we learned? |
Materials Produced
Data quality “tool kits” were produced for the following platforms:
Participants Engaged
The cohort of 20 Biodiversity Data Capacity Fellows was selected by the workshop’s organizing committee to represent individuals from a diversity of collections—including herbaria, invertebrates and vertebrate collections, and even diatoms–who had little to no prior funded engagement with the iDigBio community through attendance at workshops or TCN/PEN participation. iDigBio, the Symbiota Support Hub, and the workshop’s organizing committee sincerely thank the Fellows for their participation in the workshop and conference.
Digital Data 2024 Biodiversity Data Capacity Fellows:
Fellow | Affiliation |
Aidan Houlihan | Academy of Natural Sciences at Drexel University |
Audrey Kurz | Bishop Museum |
Caitlin Bloomer | University of Illinois |
Chelsea Smith | Academy of Natural Sciences at Drexel University |
Christiana Mojica | San Diego Museum of Natural History |
Csilla Czakó | South Carolina Heritage Trust |
David Kunkel | Oklahoma State University Herbarium |
Harpo Faust | University of New Mexico |
Isabelle Hudson | University of Connecticut |
Jeremy Cowan | Cheadle Center for Biodiversity and Ecological Restoration, UC Santa Barbara |
Joseph Mohan | University of California, Irvine |
Kelcie Brown | New York Botanical Garden |
Lindsey Worcester | Morton Arboretum |
Litsa Wooten | Museum of Southwestern Biology |
Matthew Sheik | Denver Botanic Gardens |
Michael Thomas | National Herbarium of Rwanda, National Herbarium of Bhutan (U.S.-based) |
Sarah De Groot | Kansas State University |
Stelios Chatzimanolis | University of Tennessee, Chattanooga |
Tommy McElrath | Illinois Natural History Survey |
Valerie Warhol | Carnegie Museum of Natural History |
Participant Feedback
As a follow-up exercise to meeting attendance, each Biodiversity Capacity Fellows submitted a reflective summary of their experience at the conference to iDigBio and the Organizing Committee. Overall, these reports were very positive and included constructive feedback for consideration when planning future in-person events. Representative reflections on the Data Cleaning Workshop included:
“I found the toolkit and associated resources very helpful since the data at our museum lives in so many formats (the entomology collection is in Specify, malacology in Symbiota, and Vert Zoo in Excel to name a few). The toolkit is well organized and very detailed with plenty of links out, so I know I will find myself referencing it plenty as I work through our data.”
“Overall, this conference gave me a deeper appreciation for collections and made me think of how I can be a better steward of the data that I am using, so that it is useable by more than just myself.”
“I really enjoyed the data cleanup workshop for Symbiota, I was able to clean/correct close to 1000 records while attending the workshop and the hour after it.”
“The Data Cleaning workshop was my main priority for visiting this conference, and it did not disappoint. I left this workshop with countless tools and resources to improve the data cleaning process for both my herbarium’s digitizing process and any future data work I do… It was an honor to be chosen for the Digital Data Capacity Fellowship, which made it possible for me to attend in the first place.”
“The data cleaning workshop was enlightening, revealing a great variety of data management systems and methodologies for data cleanup far beyond my expectations. This variety, although initially overwhelming, is reassuring as it provides multiple approaches to data cleaning and management. The point you made—that our goal is not to perfect our data but to fix issues that limit discoverability—was especially helpful, as it put into perspective what is really important when it comes to managing data.”
Acknowledgements
The Symbiota Support Hub extends its sincere appreciation to the workshop’s Organizing Committee, the Biodiversity Capacity Fellows, and the workshop’s presenters and facilitators—in-person, hybrid, and remote!—for their participation in this workshop, and to iDigBio (NSF #2027654) for providing the financial and logistical support that made it possible.