The document discusses batch management strategies for mass digitization projects. It describes how the U.S. National Herbarium digitization project handles digitization in batches, from scanning specimens in batches of 1,000-3,000 images to transcribing specimen labels and folder labels in batches of up to 4,000 records. Each batch is assigned an identification number to keep track of the records. Batches then need to be combined to create individual records. Issues can arise if batches are incomplete or not properly imported, so batch tracking is an important management tool.
1 of 31
Download to read offline
More Related Content
Batch Management: A Conveyor Mindset for Mass Digitization
3. 2.7 million digital descriptive records
1.6 million specimen images
0
500000
1000000
1500000
2000000
2500000
3000000
Pre-2014 2014 2015 2016 2017
Inventoried Imaged
U.S. National Herbarium Digitization
8. Batch Management
A batch is a quantity of the material produced during a given time
period or production run.
In a mass production settings, production is usually executed
in batches.
Keeping tracking of records individually can be inefficient
Conveyor digitization creates several batches of different material
each day. Batches are identified by id number.
All batches need to meet at some future point to create individual
records.
10. Alembo transcribes Specimen
Labels transcribed by Alembo
Picturae batches label
transcriptions in sets of 4000
and does preliminary review
NMNH Botany reviews label
transcriptions at 2.5% check
Accepted Label
Transcription Sets
added to Master
Transcription SQL
db
Batches of 30,000-40,000 transcription created
from master SQL db; all records in batch
reviewed for import to EMu
Rejected Sets
returned to
Picturae for
correction
Alembo transcribes cover
taxonomic names; EMu
taxonomic irns added if in
picklist
Picturae batches cover
transcriptions in sets of
100-4000 and does
preliminary review
NMNH Botany reviews label
transcriptions and adds
missing EMu irns
Taxonomic irns
added to import
batches
Import scripts run on import
batch
Import to EMu
Botany Conveyor
Project:
Transcription Workflow
11. Conveyor Batches
Scanning batch
Specimen image set (1-3K scans) batched of the conveyor
Images batched from conveyor server to DAMS
batch id follows image from conveyor to EMu multimedia record
Transcriptions batch of specimen labels
Transcription sets of specimen labels (1-4K records) batched from Alembo (1st id)
Transcription set batched for import to EMu (2nd id)
Scanning batch id kept internally in transcription records as well as included in multimedia record
Alembo transcription batch id kept internally in transcription records
Import id included in EMu catalog record
Transcription batches of folder labels/ taxonomy
Transcription sets of folder labels (400-2000 records) batched from Alembo
Folder label records are assigned EMu taxonomy irns
IRNs assigned to individual records in transcription import batch
Scanning batch id kept internally in transcription records
13. Scanning batch
Specimen image set (1-3K scans) batched of the conveyor
Images batched from conveyor server to DAMS
14. Scanning batch
Specimen image set (1-3K scans) batched of the conveyor
Images batched from conveyor server to DAMS
batch id follows image from conveyor to EMu multimedia record
15. Transcriptions batch of specimen labels
Transcription sets of specimen labels (1-4K records) batched from Alembo (1st id)
16. Transcriptions batch of specimen labels
Transcription sets of specimen labels (1-4K records) batched from Alembo (1st id)
Transcription set batched by NMNH for import to EMu (2nd id)
Alembo batch id kept internally in transcription records as well as included in multimedia record
Alembo transcription batch id kept internally in transcription records
17. Transcriptions batch of specimen labels
Transcription sets of specimen labels (1-4K records) batched from Alembo (1st id)
Transcription set batched by NMNH for import to EMu (2nd id)
Alembo batch id kept internally in transcription records as well as included in multimedia record
Alembo transcription batch id kept internally in transcription records
NMNH Import id included in EMu catalog record
18. Transcription batches of folder labels/ taxonomy
Transcription sets of folder labels (400-2000 records) batched from Alembo
Folder label records are assigned EMu taxonomy irns
19. Transcription batches of folder labels/ taxonomy
Transcription sets of folder labels (400-2000 records) batched from Alembo
Folder label records are assigned EMu taxonomy irns
IRNs assigned to individual records in transcription import batch
Scanning batch id kept internally in transcription records
20. Why does this matter?
Important management tool
Important for tracing errors and issues
21. Hi Sylvia,
I just finished reviewingTSI_20160825_BATCH_01_MS.
Overall, I would probably accept this batch, but there was another issue I noticed. Several
chunks of records do not have working JPG links, and I could not locate the barcodes in the
correctly dated folders or just in the JPG file in general. So I am not sure where the images went
for these records. There are complete transcriptions recorded for them, but Im not sure how
to check them with the images. Here were the records with issues:
ID # 138-403 (Folder dates: 02/05, 07/15, 01/29)
ID # 1154-1330 (Folder dates: 01/29, 07/22, 02/05)
ID # 1395-1458 (Folder dates: 02/05)
ID # 1483-1533 (Folder dates: 02/05, 06/16, 07/15)
So it looks like the problematic dates are: 01/29, 02/05, 06/16, 07/15, and 07/22.
Example 1
22. Everyone,
We have multiple image groups that are not in DAMS and theVFCU reports. All the dates
are Fridays with one exception.
It looks like the problematic dates are: 01/29, 02/05, 06/16, 07/15, and 07/22.
For example on 7/15 we are missing the Tiff/Iiq
01842476
From the VFCU for 7/15 I can see the last image on that day was 01842475 but the sequence
does not pick up the following day of production on 7/18.
However there is a transcribed image that we cant see nor can we find a deliverable on the
picturae server.
- Could this be a permission issue?
- A batching error?
- How are the jpgs created from the IIQ or TIF or at the point of capture?
We are working on creating a list of everything we dont have deliverables for from
TSI_20160825_BATCH_01_MS
Our concern is that there are many other dates besides the ones I outlined above which we
only discovered due to transcription checking.
24. Example 2
Hi Stephanie. I have a few pockets of missing Multimedia records in EMu, and want to
make sure that these images all exist in the DAMS before I send a request to NMNH IT to
import these images to EMu. I will give you a list of the missing images. Sylvia
25. Example 2
Hi Stephanie. I have a few pockets of missing Multimedia records in EMu, and want to make
sure that these images all exist in the DAMS before I send a request to NMNH IT to import
these images to EMu. I will give you a list of the missing images. Sylvia
Sylvia,
Most were picked up byVFCU but not delivered to EMu. These are the directories
we are focusing on:
The files from this list that DID go throughVFCU were in these directories:
nmnh-botany-20160630
nmnh-botany-20161202
nmnh-botany-20170130
nmnh-botany-20161031-reprocessed_tifs
nmnh-botany-reshoots-2016-sep-part2-reprocessed_tifs
26. Example 2
Reasons for non-delivery to EMu
Batches not marked for pickup by EMu
Batches had errors in them and not imported
Partial loading of batch, but process failed
All missing images were affected by batch errors, not individual record errors.
31. In conclusion
Mass digitization is mass production, and should be managed as such in batches
Patterns are constant in large amounts of data
Always look at the forest when thinking about the trees
Editor's Notes
#4: We significantly increased our rate of digitization and now have over 2.398 million digital descriptive records and 1.4 million specimen images. The conveyor belt has imaged over 1 million specimens, with 900,000 records being transcribed to create digital descriptive records. The remaining 100,000 were specimens that had previously been inventoried and were simply imaged on the conveyor belt. This has been a great way for us to significantly increase our rate of digitization, and with an estimated 5 million pressed specimens in our collection, it is moving us towards our ultimate goal of a completely databased and imaged herbarium collection.