Addressing Roadblocks and Envisioning Solutions of the Digital Extended Specimen Concept
On day 2 of the Digital Data Conference (June 6, 2023), Symbiota community lead Dr. Jenn Yost and collaborators Jorrit Poelen, Lindsay Walker, and Katie Pearson led over 140 in-person and online participants in a hands-on workshop about the Digital Extended Specimen (DES). The DES concept has been the topic of dynamic discussions since 2020, primarily centered on the question: how do we facilitate a highly connected, collaborative, and searchable web of data and metadata regarding biodiversity specimens?
In the workshop, invited speakers from several key projects presented their varied frameworks and approaches to tackling this question: Larry Lannom of the Corporation for National Research Initiatives, Tim Robertson of the Global Biodiversity Information Facility, Jorrit Poelen of multiple projects including Global Biotic Interactions, and Sharif Islam of Distributed System of Scientific Collections. Following these introductions, we jumped into the hands-on portion. Tables of participants were given example specimens and asked to extend them: create connections to other sources or types of data that they as scientists would find valuable and present them in the form of a bubble map (Figure 1). Participants were encouraged to document data sources, challenges, and potential solutions in a shared idea board (https://bit.ly/digital-data-des) and post a picture of their bubble map in a shared Google Slides document (http://bit.ly/ES-photo-share).
The specimen extensions created across the groups mirrored the diversity of participants’ backgrounds. Many groups linked specimens to sequence data, noting that this process is rather difficult given current infrastructure and lacks robust linkages from the sequence data sources back to their originating samples. Others linked the institutions and people affiliated with specimens to existing resources such as GRSciColl and Bionomia, and some associated occurrences could be identified using GBIF clustering algorithms. Nearly all groups linked their specimen to climatic or environmental data. Images, traits, measurements, collecting protocols, and environmental DNA were also aspirational linkages. It was noted, however, that many of these connections are not currently apparent or automated (Figure 2).
In the next phase, participants were asked to focus on just one connection in their bubble map and document this linkage in a spreadsheet (https://bit.ly/digital-data-des-worksheet). The format of the linkage would be relatively simple: one column for the unique identifier for the specimen, one column for the unique identifier or URL of the associated data source, and one column with a term to link these two columns (e.g., “isCitedIn”). This exercise quickly revealed the amount of work necessary to link specimens to all their potential connections. Several groups were able to find links to tissue samples, species distribution maps, and sequences, yet many other connections could not be robustly made due to a lack of availability (or searchability) for data types like field notes and citations in publications.
Lastly, participants were asked to envision what the search interface would look like to access extended specimen data. A central result to this discussion was that users wanted a simple search interface that allows the user to dictate the types of data that they are searching on (including, potentially, defining spatial variables via drawing polygons on a map) and which connections they want to focus on, with the results able to be visualized in a knowledge graph and specific results downloaded as a common file format (e.g., CSV). When asked whether they would expect to see this search interface within a web browser or a standalone application, nearly all participants preferred a web browser, though it was noted that users without a stable internet connection are disadvantaged or excluded by such an approach.
The session ended with a Q&A and discussion with presenters and additional invited panelists: Jutta Buschboom, representing the International Partners Group for the DES, Conrad Schoch of the U.S. National Center for Biotechnology Information, and Laura Russell of GBIF. Key topics included whether the DES is a central repository of data, or a central repository of connections between data sources, as described in recent literature (e.g., Hardisty et al. 2022, https://doi.org/10.1093/biosci/biac060); whether extended specimen data can be expected to be re-integrated into CMSs at the data source; and who is responsible for minting and maintaining the persistent identifiers that are critically necessary to create robust linkages between data sources. The question of “who” remains key in every aspect of this discussion and future work from this session. Conversations between people from multiple stakeholder communities, support of data architects creating tools and connections, and constant feedback will be a critical part of the future of the DES.
The original version of this content can be found on iDigBio.org.