際際滷

際際滷Share a Scribd company logo
BATCH
MANAGEMENT
A Conveyor Mindset for Mass Digitization
Sylvia Orli
US Herbarium / NMNH
Smithsonian Institution
Batch Management: A Conveyor Mindset for Mass Digitization
2.7 million digital descriptive records
1.6 million specimen images
0
500000
1000000
1500000
2000000
2500000
3000000
Pre-2014 2014 2015 2016 2017
Inventoried Imaged
U.S. National Herbarium Digitization
Batch Management: A Conveyor Mindset for Mass Digitization
Batch Management: A Conveyor Mindset for Mass Digitization
Batch Management: A Conveyor Mindset for Mass Digitization
Batch Management: A Conveyor Mindset for Mass Digitization
Batch Management
 A batch is a quantity of the material produced during a given time
period or production run.
 In a mass production settings, production is usually executed
in batches.
 Keeping tracking of records individually can be inefficient
 Conveyor digitization creates several batches of different material
each day. Batches are identified by id number.
 All batches need to meet at some future point to create individual
records.
Batch Management: A Conveyor Mindset for Mass Digitization
Alembo transcribes Specimen
Labels transcribed by Alembo
Picturae batches label
transcriptions in sets of 4000
and does preliminary review
NMNH Botany reviews label
transcriptions at 2.5% check
Accepted Label
Transcription Sets
added to Master
Transcription SQL
db
Batches of 30,000-40,000 transcription created
from master SQL db; all records in batch
reviewed for import to EMu
Rejected Sets
returned to
Picturae for
correction
Alembo transcribes cover
taxonomic names; EMu
taxonomic irns added if in
picklist
Picturae batches cover
transcriptions in sets of
100-4000 and does
preliminary review
NMNH Botany reviews label
transcriptions and adds
missing EMu irns
Taxonomic irns
added to import
batches
Import scripts run on import
batch
Import to EMu
Botany Conveyor
Project:
Transcription Workflow
Conveyor Batches
Scanning batch
 Specimen image set (1-3K scans) batched of the conveyor
 Images batched from conveyor server to DAMS
batch id follows image from conveyor to EMu multimedia record
Transcriptions batch of specimen labels
 Transcription sets of specimen labels (1-4K records) batched from Alembo (1st id)
 Transcription set batched for import to EMu (2nd id)
 Scanning batch id kept internally in transcription records as well as included in multimedia record
 Alembo transcription batch id kept internally in transcription records
 Import id included in EMu catalog record
Transcription batches of folder labels/ taxonomy
 Transcription sets of folder labels (400-2000 records) batched from Alembo
 Folder label records are assigned EMu taxonomy irns
 IRNs assigned to individual records in transcription import batch
 Scanning batch id kept internally in transcription records
Scanning batch
 Specimen image set (1-3K scans) batched of the conveyor
Scanning batch
 Specimen image set (1-3K scans) batched of the conveyor
 Images batched from conveyor server to DAMS
Scanning batch
 Specimen image set (1-3K scans) batched of the conveyor
 Images batched from conveyor server to DAMS
batch id follows image from conveyor to EMu multimedia record
Transcriptions batch of specimen labels
 Transcription sets of specimen labels (1-4K records) batched from Alembo (1st id)
Transcriptions batch of specimen labels
 Transcription sets of specimen labels (1-4K records) batched from Alembo (1st id)
 Transcription set batched by NMNH for import to EMu (2nd id)
 Alembo batch id kept internally in transcription records as well as included in multimedia record
 Alembo transcription batch id kept internally in transcription records
Transcriptions batch of specimen labels
 Transcription sets of specimen labels (1-4K records) batched from Alembo (1st id)
 Transcription set batched by NMNH for import to EMu (2nd id)
 Alembo batch id kept internally in transcription records as well as included in multimedia record
 Alembo transcription batch id kept internally in transcription records
 NMNH Import id included in EMu catalog record
Transcription batches of folder labels/ taxonomy
 Transcription sets of folder labels (400-2000 records) batched from Alembo
 Folder label records are assigned EMu taxonomy irns
Transcription batches of folder labels/ taxonomy
 Transcription sets of folder labels (400-2000 records) batched from Alembo
 Folder label records are assigned EMu taxonomy irns
 IRNs assigned to individual records in transcription import batch
 Scanning batch id kept internally in transcription records
Why does this matter?
 Important management tool
 Important for tracing errors and issues
Hi Sylvia,
I just finished reviewingTSI_20160825_BATCH_01_MS.
Overall, I would probably accept this batch, but there was another issue I noticed. Several
chunks of records do not have working JPG links, and I could not locate the barcodes in the
correctly dated folders or just in the JPG file in general. So I am not sure where the images went
for these records. There are complete transcriptions recorded for them, but Im not sure how
to check them with the images. Here were the records with issues:
ID # 138-403 (Folder dates: 02/05, 07/15, 01/29)
ID # 1154-1330 (Folder dates: 01/29, 07/22, 02/05)
ID # 1395-1458 (Folder dates: 02/05)
ID # 1483-1533 (Folder dates: 02/05, 06/16, 07/15)
So it looks like the problematic dates are: 01/29, 02/05, 06/16, 07/15, and 07/22.
Example 1
Everyone,
We have multiple image groups that are not in DAMS and theVFCU reports. All the dates
are Fridays with one exception.
It looks like the problematic dates are: 01/29, 02/05, 06/16, 07/15, and 07/22.
For example on 7/15 we are missing the Tiff/Iiq
01842476
From the VFCU for 7/15 I can see the last image on that day was 01842475 but the sequence
does not pick up the following day of production on 7/18.
However there is a transcribed image that we cant see nor can we find a deliverable on the
picturae server.
- Could this be a permission issue?
- A batching error?
- How are the jpgs created  from the IIQ or TIF or at the point of capture?
We are working on creating a list of everything we dont have deliverables for from
TSI_20160825_BATCH_01_MS
Our concern is that there are many other dates besides the ones I outlined above which we
only discovered due to transcription checking.
Batch Management: A Conveyor Mindset for Mass Digitization
Example 2
Hi Stephanie. I have a few pockets of missing Multimedia records in EMu, and want to
make sure that these images all exist in the DAMS before I send a request to NMNH IT to
import these images to EMu. I will give you a list of the missing images. Sylvia
Example 2
Hi Stephanie. I have a few pockets of missing Multimedia records in EMu, and want to make
sure that these images all exist in the DAMS before I send a request to NMNH IT to import
these images to EMu. I will give you a list of the missing images. Sylvia
Sylvia,
Most were picked up byVFCU but not delivered to EMu. These are the directories
we are focusing on:
The files from this list that DID go throughVFCU were in these directories:
nmnh-botany-20160630
nmnh-botany-20161202
nmnh-botany-20170130
nmnh-botany-20161031-reprocessed_tifs
nmnh-botany-reshoots-2016-sep-part2-reprocessed_tifs
Example 2
Reasons for non-delivery to EMu
 Batches not marked for pickup by EMu
 Batches had errors in them and not imported
 Partial loading of batch, but process failed
All missing images were affected by batch errors, not individual record errors.
Example 3
Example 3
Example 3
Example 3
OriginalTranscribed Data
Import set 6 Data
N is often mistranscribed as U
H is often mistranscribed as L
Jallu  Jahn
In conclusion
 Mass digitization is mass production, and should be managed as such in batches
 Patterns are constant in large amounts of data
 Always look at the forest when thinking about the trees

More Related Content

Batch Management: A Conveyor Mindset for Mass Digitization

  • 1. BATCH MANAGEMENT A Conveyor Mindset for Mass Digitization Sylvia Orli US Herbarium / NMNH Smithsonian Institution
  • 3. 2.7 million digital descriptive records 1.6 million specimen images 0 500000 1000000 1500000 2000000 2500000 3000000 Pre-2014 2014 2015 2016 2017 Inventoried Imaged U.S. National Herbarium Digitization
  • 8. Batch Management A batch is a quantity of the material produced during a given time period or production run. In a mass production settings, production is usually executed in batches. Keeping tracking of records individually can be inefficient Conveyor digitization creates several batches of different material each day. Batches are identified by id number. All batches need to meet at some future point to create individual records.
  • 10. Alembo transcribes Specimen Labels transcribed by Alembo Picturae batches label transcriptions in sets of 4000 and does preliminary review NMNH Botany reviews label transcriptions at 2.5% check Accepted Label Transcription Sets added to Master Transcription SQL db Batches of 30,000-40,000 transcription created from master SQL db; all records in batch reviewed for import to EMu Rejected Sets returned to Picturae for correction Alembo transcribes cover taxonomic names; EMu taxonomic irns added if in picklist Picturae batches cover transcriptions in sets of 100-4000 and does preliminary review NMNH Botany reviews label transcriptions and adds missing EMu irns Taxonomic irns added to import batches Import scripts run on import batch Import to EMu Botany Conveyor Project: Transcription Workflow
  • 11. Conveyor Batches Scanning batch Specimen image set (1-3K scans) batched of the conveyor Images batched from conveyor server to DAMS batch id follows image from conveyor to EMu multimedia record Transcriptions batch of specimen labels Transcription sets of specimen labels (1-4K records) batched from Alembo (1st id) Transcription set batched for import to EMu (2nd id) Scanning batch id kept internally in transcription records as well as included in multimedia record Alembo transcription batch id kept internally in transcription records Import id included in EMu catalog record Transcription batches of folder labels/ taxonomy Transcription sets of folder labels (400-2000 records) batched from Alembo Folder label records are assigned EMu taxonomy irns IRNs assigned to individual records in transcription import batch Scanning batch id kept internally in transcription records
  • 12. Scanning batch Specimen image set (1-3K scans) batched of the conveyor
  • 13. Scanning batch Specimen image set (1-3K scans) batched of the conveyor Images batched from conveyor server to DAMS
  • 14. Scanning batch Specimen image set (1-3K scans) batched of the conveyor Images batched from conveyor server to DAMS batch id follows image from conveyor to EMu multimedia record
  • 15. Transcriptions batch of specimen labels Transcription sets of specimen labels (1-4K records) batched from Alembo (1st id)
  • 16. Transcriptions batch of specimen labels Transcription sets of specimen labels (1-4K records) batched from Alembo (1st id) Transcription set batched by NMNH for import to EMu (2nd id) Alembo batch id kept internally in transcription records as well as included in multimedia record Alembo transcription batch id kept internally in transcription records
  • 17. Transcriptions batch of specimen labels Transcription sets of specimen labels (1-4K records) batched from Alembo (1st id) Transcription set batched by NMNH for import to EMu (2nd id) Alembo batch id kept internally in transcription records as well as included in multimedia record Alembo transcription batch id kept internally in transcription records NMNH Import id included in EMu catalog record
  • 18. Transcription batches of folder labels/ taxonomy Transcription sets of folder labels (400-2000 records) batched from Alembo Folder label records are assigned EMu taxonomy irns
  • 19. Transcription batches of folder labels/ taxonomy Transcription sets of folder labels (400-2000 records) batched from Alembo Folder label records are assigned EMu taxonomy irns IRNs assigned to individual records in transcription import batch Scanning batch id kept internally in transcription records
  • 20. Why does this matter? Important management tool Important for tracing errors and issues
  • 21. Hi Sylvia, I just finished reviewingTSI_20160825_BATCH_01_MS. Overall, I would probably accept this batch, but there was another issue I noticed. Several chunks of records do not have working JPG links, and I could not locate the barcodes in the correctly dated folders or just in the JPG file in general. So I am not sure where the images went for these records. There are complete transcriptions recorded for them, but Im not sure how to check them with the images. Here were the records with issues: ID # 138-403 (Folder dates: 02/05, 07/15, 01/29) ID # 1154-1330 (Folder dates: 01/29, 07/22, 02/05) ID # 1395-1458 (Folder dates: 02/05) ID # 1483-1533 (Folder dates: 02/05, 06/16, 07/15) So it looks like the problematic dates are: 01/29, 02/05, 06/16, 07/15, and 07/22. Example 1
  • 22. Everyone, We have multiple image groups that are not in DAMS and theVFCU reports. All the dates are Fridays with one exception. It looks like the problematic dates are: 01/29, 02/05, 06/16, 07/15, and 07/22. For example on 7/15 we are missing the Tiff/Iiq 01842476 From the VFCU for 7/15 I can see the last image on that day was 01842475 but the sequence does not pick up the following day of production on 7/18. However there is a transcribed image that we cant see nor can we find a deliverable on the picturae server. - Could this be a permission issue? - A batching error? - How are the jpgs created from the IIQ or TIF or at the point of capture? We are working on creating a list of everything we dont have deliverables for from TSI_20160825_BATCH_01_MS Our concern is that there are many other dates besides the ones I outlined above which we only discovered due to transcription checking.
  • 24. Example 2 Hi Stephanie. I have a few pockets of missing Multimedia records in EMu, and want to make sure that these images all exist in the DAMS before I send a request to NMNH IT to import these images to EMu. I will give you a list of the missing images. Sylvia
  • 25. Example 2 Hi Stephanie. I have a few pockets of missing Multimedia records in EMu, and want to make sure that these images all exist in the DAMS before I send a request to NMNH IT to import these images to EMu. I will give you a list of the missing images. Sylvia Sylvia, Most were picked up byVFCU but not delivered to EMu. These are the directories we are focusing on: The files from this list that DID go throughVFCU were in these directories: nmnh-botany-20160630 nmnh-botany-20161202 nmnh-botany-20170130 nmnh-botany-20161031-reprocessed_tifs nmnh-botany-reshoots-2016-sep-part2-reprocessed_tifs
  • 26. Example 2 Reasons for non-delivery to EMu Batches not marked for pickup by EMu Batches had errors in them and not imported Partial loading of batch, but process failed All missing images were affected by batch errors, not individual record errors.
  • 30. Example 3 OriginalTranscribed Data Import set 6 Data N is often mistranscribed as U H is often mistranscribed as L Jallu Jahn
  • 31. In conclusion Mass digitization is mass production, and should be managed as such in batches Patterns are constant in large amounts of data Always look at the forest when thinking about the trees

Editor's Notes

  • #4: We significantly increased our rate of digitization and now have over 2.398 million digital descriptive records and 1.4 million specimen images. The conveyor belt has imaged over 1 million specimens, with 900,000 records being transcribed to create digital descriptive records. The remaining 100,000 were specimens that had previously been inventoried and were simply imaged on the conveyor belt. This has been a great way for us to significantly increase our rate of digitization, and with an estimated 5 million pressed specimens in our collection, it is moving us towards our ultimate goal of a completely databased and imaged herbarium collection.