A Guide to establishing Symbiota Collections for Pakistan Herbaria
With one exception, you have either received barcodes and a barcode scanner from me or we have had other contact re digitization. Dr. Perveen, you are included because you are in charge of KUH. I hope you will jump on this opportunity to start digitizing KUH. The emphasis on “start” applies to all but the smallest herbaria. GBIF’s goal is to push natural history collections, including herbaria, in sharing data. Speaking from experience, as students become aware of how data can be used, they tend to record more, higher quality information and more likely to use the resources that are becoming available.
Enough preliminary work has been done to enable you to work together in developing a truly competitive proposal. The purpose of this document is to outline what exists and how it can be built on. BUT to be truly competitive, some work is needed. A successful project should meet GBIF’s priorities:
- Digitizing and publishing georeferenced species occurrence data based on specimens held in Asian collections
- Compiling inventories of biodiversity data holdings (for example, by implementing metadata catalogues)
- Preparing data papersto improve the reusability of the mobilized biodiversity data
Highlighting existing resources and capacity will be a major strength. What follows is a summary of what exists and, in a general way, what could be done with the funding available. Budget will have to be worked out very shortly. There are basically two parts in what follows: Existing resources and how they can be built on to address each of GBIF’s priorities and a synopsis of the existing capabilities of Symbiota, the program that runs both OpenHerbarium and OpenZooMuseum.
Building on existing resources
Digitizing and publishing georeferenced species occurrence data based on specimens held in Asian collections
Specimen data: Because FoPK cites specimens and inmates here have captured those data, OpenHerbarium contains lots of records for Pakistani herbaria: KUH 8825; RAW 9702; PPFI & PPFI-B: 506; ISL 295. To find them, Go to OpenHerbarium (http://OPenHerbarium.org) and then:
- Click Search collections. This will bring up a list of all the collections.
- Click on THE NAME Flora of Pakistan. It is near the bottom of the list with a bright yellow icon.This will bring up a page describing the “collection”. You should each see a pencil in the top right corner. If you do not, send me an email to let me know.
- Click the pencil. The brings up the page allowing you to do things in the collection (if you are administrators).
- In the top box, click “Edit existing occurrence records”. This will bring up a search box like the one below.
- I filled out the above to find all records from KUH by selecting “Other Catalog Numbers” for “Custom field 1” and “contains” and “KUH” for the criterion. I could also have set the Sort by: to “scientific name”. That will be useful for what I am proposing.
The records are georeferenced – but only to the grids. GBIF wants better resolution than that. Certainly, users do. First thing is to get the existing records, and others, in under the appropriate collection. To do this:
- Someone in the appropriate herbarium pulls the specimens (to find them, follow the steps above for the herbarium but add sorting by scientific name or family – unfortunately the system only allows one sort level)
- Add a preprinted barcode to them (I am prepared to order some, plus a barcode scanner, for KUH and either PUP or PPFI).
- Follow steps 1-5 above. In the top box of the “things you can do page”, click “Add a new record”
- Click “Auto check” dupes
- Scan in the bar code, then enter the collector’s name, number, and collection date. This should tell you there is a duplicate record in the system and allow you to view it. It may show you several, particularly if there was no collection number. The inmates entered names as last name first, followed by other names. And may have misspelled names. The check only uses last names. For us it is a pain if that is “Jones”.
- Once you can see the duplicate record, copy over data into the empty fields. If the duplicate record is not found, enter the data and save it. There is no point refiling it without doing that.
- If the specimen has since been annotated, it is best to add the current name on the front page and then add previous determinations using the “Determination History” tab.
- Then save the record. Repeat steps 5-7 until all records entered from the Flora of Pakistan have been dealt with.
For those of you with no records that have been cited in Flora of Pakistan (SWAT, QUETTA, BGH, PUP), please put some records up. THEY MUST HAVE CATALOG NUMBERS. There are help documents available at Symbiota. Org.
A major use of funding will, we hope, pay for entering additional records. For these a common emphasis should be adopted. How about focusing on families with lots of edible species – to address a major concern in some UN listing – food security. Plus perhaps the dominant species in major plant associations in Pakistan? For those with substantial holdings from Balochistan, it might be good to give them priority because it seems data poor (see attached poster)
I shall write to the most frequently cited non-Pakistani herbaria to ask permission to display their records in OpenHerbarium. The advantage of connecting directly to the herbaria rather than through GBIF is that any annotations (e.g., georeferencing, correction of a spelling) can be sent directly to the relevant collection manager. Once the connection is established, they do not have to do anything – and the connection is easy to establish – but it is part of the why I shall be asking for all the IT money. I am going to ask that directions for accomplishing this be prepared so that we do not have to ask an IT person for assistance except when essential.
That gets records in. Georeferencing is another priority: We need to have a session on georeferencing at the Pakistan Botanical Meetings. It means adding lat, long, uncertainty, and datum to a record, not just lat and long. There is a very useful tool in OpenHerbarium for Georeferencing: Geolocate. Even better, it enables batch georeferencing, that is looking for all records that have, for example, Ziarat in their locality information, sorting them alphabetically, selecting those that are the same, then using Geolocate to georeference them all at once. I suggest that we divide the task up so that people can focus on the region of the country they know best. One reason the inmates did not attempt georeferencing is that they do not know Pakistan (they did become interested in it and Shafqat Farooq visited the jail to talk with them about it when she visited me last year). Another reason was that they do not have internet access.
If we used some money to purchase imaging equipment for KUH and/or RAW, it would be possible for qualified individuals elsewhere to assist in completing data capture for them. Within the herbarium, the focus would be on capturing skeletal data: taxon name, country, province. This would provide an overview of the holdings. Then qualified individuals could work from a computer lab to complete the data entry. There is also software that I have not used as yet to enable people to focus on a particular part of a label. I will try it out and share my reaction with you.
Compiling inventories of biodiversity data holdings (for example, by implementing metadata catalogues
This would be relatively simple. There is a set of fields that has been recommended for describing a natural history collection. Let’s add to this task ensuring that each collection is registered in either Index herbariorum or GRBio and extend it to ALL natural history collections in Pakistan. I shall create a standard form (probably a spread sheet) in hope that we can simply upload the data all at once. It should not take any of them more than 1 hour and will make their existence more evident.
At present, Index herbariorum lists 17 herbaria in Pakistan. There are 6 additional records in GRBIO (one is .FCBP). Probably neither list is complete. Faisalabad has a herbarium but it is not in Index herbariorum.
Preparing data papers to improve the reusability of the mobilized biodiversity data
Today Pensoft published a data paper: https://phytokeys.pensoft.net/articles.php?id=20531. Take look at it to see what Is meant by a data paper (and the link in header which comes from GBIF)
Data for an even better article on the Flora of Pakistan is already available – inmates in a Utah county jail captured it. The data are in OpenHerbarium. With some effort, we could prepare a joint data paper on the Flora of Pakistan that has as its Foundation the data they obtained – which enables discussion of data density across Pakistan, a topic that interests GBIF (see attachment). The one problem that I know of is that because of poor design (my fault), there may be species and/or specimen records not included. Part of the collaborative work would be to split up the families among you so that we can determine which species are missing and at least add them in as species with 1 specimen record.
Being able to make such a solid promise in terms of a data paper will strength the proposal. It would also be a way of publishing the comprehensive checklists that have been or are being proposed. Let’s start work on it right way. Perhaps different aspects could be presented at the meetings in Peshawar. By then we could at least have checked the species list.
It would be worth considering the extent to which family position has changed since publication (e.g., inclusion of Flacourtiaceae in Salicaceae – revised key at http://keybase.rbg.vic.gov.au/projects/show/34), possibly number of changes at generic and species level (gets more difficult at lower levels), also new additions to records based on records from elsewhere (I have copied some into OpenHerbarium but the preocedure mentioned earlier will yield many more records. In with this activity,we could also establish synonymy relationships within OpenHerbarium.
Symbiota, the program that runs OpenHerbarium and OpenZooMuseum
Symbiota is well-established, powerful software that has three key abilities: it enables direct data entry, uploading from some other database or a spreadsheet; it aggregates data from different collections and draws on all records (unless the user specifies otherwise) to answer questions; it can make records available, at the a collection manager’s discretion, to other aggregators such as GBIF or the Himalayan Uplands Plant database.
There are ways in which Symbiota could be improved. In funding such improvements for this project, we would benefit other projects just as we shall be benefiting on the huge investment that other projects have made, and are making, in the program. Some of the improvements I have in mind:
- Create a downloadable, self-installing version that would include in the nomenclatural and geographic files and that when a “synchronize button” was clicked would:
- Upload new and modified records to the network;
- Download new and modified nomenclatural, geographic, and specimen records – with the ability to specify the country for the latter. This would minimize the impact of slow and/or intermittent internet access.
- Enable the taxonomy viewer to show only family genus and species (for grasses, I often like to look at them sorted by subfamily and tribe but other times it is easier if the genera are listed alphabetically; people who work with composites would probably prefer never to see grasses organized by subfamily and tribe but like to do so for composites).
- Enable uploading multiple images for a taxon or record via a csv file.
- Have the duplicate discovery feature search the General Observations records for a duplicate catalog number. It is would enable a more efficient workflow.
- When searching within a collection, allow at least two sorting levels.
- Enable entering location information from closest to most general, in other words from 2.3 miles up US 89 from Logan to United States rather than the other way around– and automated selection of upper levels. This would increase efficiency because the lowest level will probably need to entered, the next higher level, District in Pakistan, might consist of almost unique (if not unique) names so once one starts to enter them, the province and country could be entered.
- Add a field, continent, to the locality fields. To be honest, this would be most useful when looking for records from Africa. Perhaps too one would want to sort out south Asia, west Asia. COMMENTS WELCOME.
- Develop standardized names for taxon pages in Symbiota and names in KeyBase to make linking the two more efficient.
- Just a note: GIS tools are now being added to Symbiota and will be available in OpenHerbarium and OpenZooMuseum once we migrate the two databases to a different server – currently scheduled for early December. If the proposal we submitted is funded, it will also be possible to show the distribution of phylogenetic density across Pakistan without leaving Symbiota.
We can also create a URL that would show the Pakistani collections in the top box and, by default, display maps with Pakistan at the center, and have a blurb tailored to Pakistan. It could also list funders of Pakistan’s herbaria, particularly their digitization efforts – for example, HEC and any agency you could persuade to help with finding matching funds. One can match with time but GBIF would be delighted to see that there had been success in persuading local institutions (including your own)/agency/people to make a significant donation to help with digitization.
Another thing to think about: Symbiota is also the software behind OpenZooMuseum – so perhaps it would be worth promoting its use to colleagues – and then we could make a portal that covers both plants and animals. One has to maintain two underneath because of the names that are used for both plants and animals. Almost accepted fungal names are in OpenHerbarium as are all the names in the Flora of Pakistan plus several others – but not all those in the Plant List because it accepts some that I do not and does not accept that I do. We shall have to discuss how to handle taxonomy but that would be easier at the PBS meeting I suspect.
A standard cost estimate for data capture, no equipment, is $1 per specimen, which for most digitizers that is 1 record completed/reviewed per 6 minutes. That is at US pay scale but it is still a guide line.
Let’s hope that bringing existing records over, checking and completing them will take about half the time of entering from scratch.
We need to mobilize as many records as possible but we also want to make digitization seen as an integral part of running a herbarium. This requires committed leadership, not just reliance on new staff positions. GBIF can get things started in large collections. We use undergraduate students. All graduate students should be able to enter specimen data into OpenHerbarium and make use of its tools.