Batch Loading Specimen Images

There are several workflows used to batch link specimen images. Batch processing typically consists of two separate steps: 1) Batch loading images onto a web server. This step simply makes the images available to a web browser. 2) Map image URLs to specimen records residing within the Symbiota portal. Technically, images are not stored within a Symbiota database. Instead, the URLs of the external images are mapped to the metadata within the specimen tables. One should contact the portal manager to inquire the workflow details supported by a given portal. 

Common Workflows

  • Image Drop Folder – Using a mapped network drive, FTP, or SFTP, a drop folder is made available so that imaging team can regularly deposit new images. This drop folder will be referred to as the source folder in the text below. Images should be JPGs, or another web friendly format. Since specimens typically have large amounts of white space, applying a small amount of JPG compression can significantly reduce the size of the image file without notability affecting the image quality. Archive images (TIF, DNG, RAW, etc) are not address here since they typically are not stored on a web server associated with the portal. 
    • Image Processing – Once images are dropped in the source folder, processing scripts create web-ready image derivatives and placed on a web server. Three copies are typically created: basic web, thumbnail, and a large version. Image derivatives are mapped to the specimen records through the image URLs.
  • Local Storage – If images are stored on a local server that is write accessible to the portal code, image processing can be preformed directly through the portal interface. Image URL stored locally can be mapped in the database using relative links without the domain name. In this case, the the web server user (e.g. Apache user) must have write access to both the source and target folders.
  • Remote Image Storage – Images can also be processed and stored on a remote server and mapped to the specimen image through the full image URL. Standalone image processing scripts will be needed to process images and map image URLs to the portal database. Scripts can configured to write the image URL directly to the database or image metadata can be written to a log file, which is loaded into the database afterwards. Remote images must be mapped in the database using the full image URL with the domain name. Standalone script can be found in the following Symbiota directory: /trunk/collections/specprocessor/standalone_scripts/
    • If only a large image is made available from the remote server, the image URL can be mapped to the urloriginal field, and then portal will then create local web and thumbnail dirivatives. If the images are named using the catalog number, and the web server is configured to display directory contents, the Processing Toolbox within the collection’s management menu contains a method to harvest image links from the directory listing. Alternatively, one can preform a “skeletal record import” that contains a column with the catalogNumber and another associatedMedia column with the image URL. 
  • Image Servers
    • iDigBio Media Server – This workflow is a two step process that involves 1) using the iDigBio Media Ingestion Appliance to load images onto the iDigBio image server, and 2) mapping the images to one, or more, Symbiota instances by uploading the iDigBio ingestion report. For detailed instructions, see the iDigBio Media Ingestion Application page.  
    • iPlant Storage – This workflow is a two step process that involves 1) Using the iPlant iDrop appliance to load the images onto the Bisque image server, 2) Map the image URLs to specimen records within the Symbiota portal using the iPlant image mapping tools available from the Processing Toolbox.

 

Variations

  • Mapping Image to Specimen Record – Coordinating the image with the specimen record is typically done using a specimen unique identifier. The catalog number (e.g. barcode, herbarium accession, etc) typically serves this purpose, however, this is only possible if catalog numbers are populated in the database with truely unique values. If the unique identifier is the barcode, the identifier can be obtained from the specimen image using OCR. An alternative method that is sometimes more reliable is to place the identifier in the image file name. In either case, a regular expression term is typically used to extract the identifier. For example, regular expression /^(ASU-L-\d{7})\D*/ will extract ASU-L-0001234 from image file named ASU-L-0001234_a.jpg.
  • Unprocessed Records – If an record is found in the portal matching the identifier, the web images will be linked to the existing image. If the record does not yet exist, scripts can be configured to leave the record in the source folder unprocessed, or to load the image linked to a new skeletal record containing only the identifier within the catalog number field and “unprocessed” within the processingstatus field. The “unprocessed” records are available for future processing through crowdsourcing, OCR/NLP, duplicate batch processing, or regular data entry through the portal using the specimen image. These records can be retrieve from the data entry form by searching for “unprocessed” records.
  • Skeletal Data Files – Certain specimen metadata (e.g. country, state, filed by taxon name, etc) can be efficiently recorded in a skeletal data file by the imaging team. Symbiota image processing scripts will process CSV skeletal file and seed unprocessed records with that data. Skeletal data file must be a CSV file with column names that exactly match the Symbiota field names within the database (e.g. no spaces). A JAVA desktop application has been developed by the LBCC project to aid imaging teams in both renaming images using the barcode and collecting skeletal metadata.