Emergence of Symbiota and Current Needs

Symbiota Overview: Biodiversity informatics is a rapidly developing field that uses a large amount of data on the current and historical distribution of species to address important research questions such as the impact of climate change on biological diversity. Symbiota [28] has been a key player in providing the necessary data infrastructure. It is built on the premise that a collaborative partnership of biodiversity informaticians, collection managers, and biodiversity research communities will be most effective in creating high quality biodiversity research resources with publicly useful portals [27]. Symbiota has two fundamental and overlapping functions: (1) it serves as a data management system for entering, annotating, and cleaning biodiversity occurrence data and associated specimen data (e.g., genetic sequences, images, publications) (Figure 1, left panel); (2) it acts as a primary aggregator/publisher (Figure 1, center panel) for data providers, primarily museums and herbaria, regardless of the software they use to enter and annotate data. Finally, it provides data to global aggregators and the public through Darwin Core (DwC) archives.

Symbiota Emerges as a Key Platform for Management of Biological Specimen Data: Symbiota emerged in the late 1990s primarily as an online collections management system to meet the needs of herbaria. With the onset of NSF’s Advancing Digitization of Biodiversity Collections (ADBC) program in 2011, the use of Symbiota as an online biodiversity data management software platform expanded significantly. It is now used by 766 collections (412 live, 354 snapshot; see description below) to share data via 40 portals structured by taxa, regions, or institutional holdings. Symbiota is a core component for 13 Thematic Collection Networks, and 8 Partners to Existing Networks (PEN) projects funded by NSF’s ADBC program [65] comprising 74 percent of NSF-ADBC projects. This has led to an exponential increase in Symbiota data since 2011: Symbiota portals have grown to offer over 37 million occurrence records (11 million live collections, 26 million snapshot collections), encompassing diverse taxonomic groups such as vascular plants, lichens, bryophytes, algae, fungi, invertebrates, and vertebrates. Symbiota data, or the software itself, has been cited in 1,993 publications (Google Scholar, 24, August, 2017). Fourteen of the largest Symbiota portals collectively averaged 1,601 sessions per day (Google analytics for Oct 1, 2015 to Sept 30, 2016).

Portal Organization and Symbiota Workflow: A Symbiota portal is structured around data collections, usually grouped by data provider, with each collection maintaining its own dataset. Data may be generated directly in the Symbiota portal (Figure 1, left panel) or integrated from other sources (Figure 1, middle panel). Symbiota portals then utilize a suite of tools to annotate data (e.g., georeferencing, computer-aided identification), package datasets (e.g., checklists), and visualize data for research, conservation, and education (e.g., a spatial module). We are constantly adding functionality to Symbiota in response to user requests via the Symbiota GitHub repository [62]. Data and images are publicly available via Darwin Core (DwC) archives [21][57] with the ability to link to Global Biodiversity Information Facility (GBIF) [72] and Integrated Digitized Biocollections (iDigBio) [2]. GBIF is the recognized global aggregator for all biodiversity information (607 million records served) and iDigBio is the National Resource for the ADBC program (106 million records served) (Figure 1, right panel). We expect that most Symbiota data will be accessed at aggregator portals, however we know that many users, especially data providers, use data directly from Symbiota portals.

The power of Symbiota comes from its ability to seamlessly integrate occurrence data with images and taxonomic information pages, and provide multifunctional checklist tools used by government agencies [24], educators [8][75], and researchers [58]. Symbiota also provides tools that allow researchers to compile and screen datasets for further analysis [27][58], educators to integrate the use of biodiversity data in classroom activities, and land managers to compile data on species for management planning. Symbiota connects to global resources such as GenBank [29] and Encyclopedia of Life [73], and can be used by collections to supply data to aggregators, such as iDigBio and GBIF.

Most of the 40+ Symbiota portals focus on a taxonomic group of interest (e.g., plants, arthropods) and a geographic area, typically within North America. A Symbiota portal is very decentralized: each data provider has control over their own data, although everyone operates within shared data standards and practices. There are two primary categories of “collections” in a Symbiota portal: (1) “live collections” that enter, clean (i.e., edit), and annotate data directly in the portal via a browser, and (2) “snapshot collections” that enter, edit, and annotate data with various other software programs (Figure 1) but then provide their data to one or more Symbiota portals. All collections benefit from the shared data in terms of cleaning/annotating data, identifying gaps in their digitization effort, batch georeferencing, and quickly assessing data available for research. By offering support for a variety of data sources, Symbiota portals maximize the ability for everyone to clean and annotate their data by allowing collection managers to choose the approach that best fits the needs and abilities of their institution while maintaining (‘behind the scenes’) data standards that allow for open data sharing.

There are four general categories of people connected to Symbiota portals.

  • Core developers who provide the development and software support for all the portals.
  • Portal managers and power users who offer front-end support and added-value functions such as checklists and taxonomic tables. Data providers and end users interact mostly with portal managers. Power users are particularly important for identifying bugs and helping developers create new functionalities.
  • Data providers, the largest group of Symbiota users, are primarily undergraduate students and volunteers managed by collection curators. There are currently over 1,000 data providers for the 40+ portals. Data providers also offer critical feedback that has helped create efficient workflows for transcribing labels and providing images. Input from core developers, portal managers, power users, and data providers has led to a 50 percent reduction in digitization cost per specimen.
  • End users include researchers and educators that integrate the data and resources provided into their research and teaching and land managers who use the data to guide restoration efforts, species conservation planning, and track invasive species.

Organizing the Symbiota Community In 2016 the Symbiota Working Group [63] was created to develop a more formal long-term structure that would better address community needs and create sustainability pathways for Symbiota. As part of that effort, conversations with developers of other software platforms that overlap in function with Symbiota have been ongoing, focusing on the possibility for collaboration and for identifying “collective impact” activities that promote interoperability among competing software platforms. The consensus is that the programs differ enough in their base code that integrating platforms is not an option, whereas sharing modules and functionality through Application Programming Interfaces (APIs) is a fruitful avenue. For example, the Symbiota development team is collaborating with TaxonWorks [64] to add support for their taxonomic services and in turn, adapting the Symbiota spatial module to be compatible within TaxonWorks. As Symbiota development is enhanced through greater modularity, future collaborations are expected.

Need for Restructuring and Enhancing Symbiota: Symbiota supports a rapidly expanding user community but its code complexity and design are impeding further development. Symbiota is a large, complex software system. It is written in PHP and Javascript and, as of August 2017, consisted of 160,603 lines of code spanning 168 classes (typically one class per file) and 3,386 methods (functions) within the core code base, excluding external libraries. The classes are highly inter-related as depicted in Figure 2. Each class is listed around the outside of the circle. An edge connects a class that invokes a method in another class.

Software metrics [31] can be used to estimate the complexity of software and the difficulty of extending, maintaining, and fixing bugs [36]. We computed the software metrics for Symbiota using PHPMetrics [51]. Symbiota scores poorly on several metrics, for example, Figure 3 depicts its cyclometric complexity (number of paths through the code). Large red circles are classes with high complexity, which impairs maintainability and code correctness. Symbiota also scores poorly in PHPMetrics’ summary evaluation along five categories relative to other PHP projects: maintainability (19 out of 100), accessibility for new developers (0 out of 100), simplicity of algorithms (0 out of 100), volume (0 out of 100), and reducing the probability of bugs (0 out of 100). Symbiota’s low scores imply that it is costly to develop, extend, and maintain in terms of programming time and effort [10]. One reason Symbiota has low scores is that its functionality is highly integrated. A single class will combine code for the user interface, database queries and updates, and other functionality. So, for example, to change the user interface to create a responsive layout adapted to a mobile platform may require reprogramming hundreds of files.

Symbiota is typical of biodiversity and bioinformatics data management applications in that it has a long development cycle, usually lasting years. The development investment makes the projects valuable, but over time new needs arise, and users want new features [40]. The Symbiota user community has posted over ninety desired changes to date on the Symbiota Issues list on GitHub, ranging from narrow (e.g., “add a field for host organism to the database”) to expansive (e.g., “make it so that data usage statistics for every record are available”) [62]. Integrating these changes into Symbiota would be much easier with a better software structure.

In summary, Symbiota has succeeded in creating many online biodiversity collection communities and growing a dedicated community of users. But Symbiota needs to be restructured to expand the developer pool, incorporate new functionalities, and enable even more collections to share data.