Specimen Upload Procedure

Creating a New Collection Profile
If this is the very first time data is to be loaded, a system administrator will need to first create a metadata record for the target collection. The easiest way to create a new collection profile is to log-in as a system administrator, navigate to the Collection Profile page (/collections/misc/collprofiles.php), and clicking on the Add New Collection Profile symbol located to the upper right of the page. A link to this page can also be found in the sitemap for the portal (/sitemap.php). Although most of the fields are self explanatory, a few details are explained below.

  • Selecting the “allow public edits” will allow any logged in user to edit your collection data. However, edits from those without explicit editing permissions will not be applied until an administrator for the collection reviews and approves the edits. The edit reviewer is available within the administrator control panel for each collection.
  • Collection type: General observation is typically used for personal specimen management. This allows field researchers to enter and manage their collection data as “observations” before they are submitted to a collection institution. Specimen labels for these collections can be printed within the portal. After submitting specimens to a physical collection, the receiving institution can transfer these records without having to retype the label data. Typically, there is only one General Observation project per data portal.
  • Management: Snapshot is for collections that maintain their central database within the home institution and only feature a snapshot of their data within the portal. Live Data means that the collection manages their data directly within the portal. The portal dataset is their central database.
  • The “icon” field contains a url for a small icon used to identify collection. This icon is usually placed in the /images/collicon/ folder.
  • When available, the “IndividualUrl” field can be used to create a dynamic link to the record page at the source institution. This link is displayed on the Symbiota Specimen Detail Page. Enter a url with “–PK–” placed where the source primary key (dbpk) should be inserted (e.g. http://sourceinstitution.edu/herbarium/db/specimendetails.php?id=–PK–).

Field Mapping
File and direct upload procedures require source and target (Symbiota portal) fields to be mapped to one another. A user interface (see screen snapshot below) is supplied to aid the mapping process. For collections that use the portal to feature a data “snapshot” of their central database, the first step is to select the required field that will serve as the primary specimen identifier (primary key) that links the records in the source database with those in the portal. This field can be alphanumeric but must be non-null, unique, and persistent over time. Any records with null or duplicate values will filtered out and fail to load. Collection that are “live” data sets that managed directly within the portal do not require a primary key field since new records are only appended to the existing data set and not permanently mapped. The majority of Symbiota fields are Darwin Core compliant in naming and data type; however, upload scripts are capable of converting many incompatible data types to match the Darwin Core definitions. Automap option within interface will match source columns using the most recent Darwin Core naming syntax. Field mappings can be saved, thus maintaining a static output data format will aid future uploads.

checklist_edit3

Stored Procedures for Custom Cleaning and Mapping
A stored procedure can be created and linked to one or several upload profiles. These stored procedures typically contain a series of SQL statements to preform data cleaning and transformations that are custom to the particular upload procedure. The combination of field mapping and custom stored procedures typically allow data managers to establish upload procedures that require no pre-manipulation of the source data. If the collection manager is capable of maintaining a consistent output format, they can usually preform their own data updates without the assistance of the portal data manager.

Specimen Upload Procedure
System administrators can establish one or several upload procedures for the collection using the Specimen Upload Module (/collections/admin/specimenupload.php). A collection may want a quick upload procedure that only grabs the latest modified records as well as a more extensive procedure that performs a full refresh of the dataset. Details on specific upload procedures are listed below:

Direct Upload – Database to database transfer from source to the Symbiota node.

  1. Create new profile with the following fields: profile title, database platform, server name or IP address, port, login name, password, schema (database) name, stored procedure with instructions for data cleaning this specific upload, and SQL statement used in querying data. Note that SQL can be written only to return a subset of data, such as records modified or added within the last month.
  2. If collection type is a data “snapshot” of a specimen database managed within the home institution, select the primary key for source specimen record. Snapshot datasets must contain a field that will serve as the primary record identifier. This field will serve as the permanent link between the source database and the portal records. This field must be populated for every record with unique values. The values need to be stable without changing over time. Field can be numeric or text field type. The typical data used for this field is the catalog number (accession number), barcode, or database primary key from the source database specimen table.
  3. Map source fields to the Symbiota fields. Source column names do not have to match those of Symbiota, but data type and definitions must comply with one of the Darwin Core Standards. The Automap button will automatically map and save all matching field names for future uploads. If the source SQL statement is changed, mapping will have to be adjusted to match new upload definition.
  4. Upload data. During upload, data is placed in a temporary specimen table (uploadspectemp) so that data cleaning and integrity checks can be performed by the collection specific stored procedure.
  5. Perform final transfer.

DiGIR Upload – one of the most widely endorsed protocols for data exchange of natural history collections

  1. Create new data upload profile with the following fields:
    • Profile title
    • Server (host) – domain name where DiGIR provider has been installed (e.g. http://ourserver.edu).
    • Port – usually 80
    • Code for collection that was assigned when DiGIR provider was established. Visiting the target DiGIR page (e.g. http://ourserver.edu/digir/DiGIR.php) will display the resources available along with their resource codes.
    • Path – url path to DiGIR provider on target server (/digir/DiGIR.php).
    • DiGIR Primary Key – field that can be used as the primary key for that collection (e.g. catalogNumber or occurrenceId).
    • Schema – schema used to define which fields should be returned (e.g. http://swbiodiversity.org/seinet/collections/admin/util/darwinsymbiota.xsd)
    • Query String – XML fragment defining which records should be returned.
      • If left blank, following default query will be used to retrieve all records: <like><darwin:collectioncode></darwin:collectioncode></like>
      • For specific records: darwin: <and><equals>darwin:country>Canada</darwin:country></equals><like<darwin:family>Asteraceae</darwin:family></like></and>
    • Stored Procedure used for data cleaning and integrity checks.
  2. Upload data. During upload, data is placed in temporary specimen table (uploadspectemp) so that data cleaning and integrity checks can be performed.
  3. Perform final transfer.

*Note that DiGIR harvesting is not the fastest method for grabbing data. Sometimes it makes sense to establish several DiGIR harvesting profiles to upload records in sets and append to existing records. For example, one profile using <equals>darwin:country>Costa Rica</darwin:country></equals>, another using <equals>darwin:country>Panama</darwin:country></equals>, etc. Each record set would be downloaded separately and appended to previously obtained records.

CSV File Upload – Upload flat CSV files that have been extracted from source database.

  1. Create new profile with the following fields: profile title, stored procedure for data cleaning and integrity checks.
  2. Select primary key for source specimen record. Field must be a non-null, unique ID for the specimen record. Field can be numeric or text field type. This field will be used to update modified records during future uploads, therefore field is required and must remain the same for all uploads.
  3. Map source fields to the Symbiota fields. Source column names do not have to match those of Symbiota, but data type and definitions must be compliant to the one of the Darwin Core Standards. The Automap button will automatically map and save all matching field names for future uploads. If the source field definition changes, mapping will have to be adjusted to match new upload definition.
  4. Upload data. During upload, data is placed in temporary specimen table (uploadspectemp) so that data cleaning and integrity checks can be performed. Cleaning stored procedure specific for this collection is performed on this table.
  5. Perform final transfer.

System Script – MySQL source to Symbiota database that islocated on a different server

  • Write file script used to transfer records. A sample Linux script is located here: SampleSystemUpload.sh
  • Collection cleanup scripts can be put in central stored procedure or kept separate
  • Setup script to run as a regular cronjob

SQL Stored Procedure – transfer from source schema to Symbiota database located on the same MySQL database server

  • Write file stored procedure used to transfer records
  • Collection cleanup scripts can be put in central stored procedure or kept separate
  • Setup script to run as a regular cronjob

Uploading Tips

  • Collection dates mapped to eventDate will be evaluated and validated. Illegal dates will be placed in the verbatimEventDate field. The majority of the standard date formats are accepted, including Gregorian dates and Excel numeric date format (US only).
  • eventDate will be generated from separate year,month, and day field values. If month or day fields are left null, ’00’ values will be used (ex: 1954-03-00, 1965-00-00). Month field values can be numeric or text (English or Spanish).
  • Scripts attempt to extract valid date values from verbatimEventDate field when the eventDate field is null. Values of ’00’ are used for missing month or day (ex: 1954-03-00, 1965-00-00)
  • Coordinate values
    • Scripts will attempt to extract lat/long coordinate values from the verbatimCoordinates field. Field is evaluated for DMS and UTM formats which are converted to decimal latitude and longitude.
    • Map verbatim lat/long that exist in a single field to decimalLatitude and leaving decimalLongitude null in order to direct scripts to parse using only the lat/long parser.
    • Mapping separate UTM fields (northing, easting, zone) to their matching UTM fields will direct scripts to convert UTM to decimal latitude longitude. UTM values will be placed in verbatiumCoordinate field.
    • Map verbatim UTM that exist in a single field to utmNorthing and leave other UTM fields null in order to direct scripts to parse using only the UTM parser.
  • Scripts attempt to extract elevation values from the verbatimElevation field. Elevations in foot units will be converted to meters.