際際滷

際際滷Share a Scribd company logo
Towards a Higher Accuracy of Optical
Character Recognition of Chinese Rare
Books in Making Use of Text Model
Hsiang-An Wang
Academia Sinica
Center for Digital Cultures
Ink Bleed and Pool Quality
2
Limitation (Missing and Extra Word)
OCR Original OCR Original
3
Experiment: Data Collection
 Training dataset: 187 ancient medicine books
from the Scripta Sinica Database (about 40
million words)
 Testing dataset: 1 relevant ancient medicine
book named   with a total of
185,000 words
 The OCR results are about 180,000 words
correct and about 5000 incorrect words,
which means the correct rate is about 97.3 %
4
Experiment: Building a N-gram Model
 Relied on the sequence of words in the
training dataset, and thus we picked the
highest frequency of output.
 " "
 2-gram: input to predict " "
 3-gram: input predict " "
 4-gram: input predict " "
 ...
5
Experiment: Building a
Backward and Forward N-gram Model
 Relied on the sequence of backward and forward
words in the training dataset, and thus we picked the
highest frequency of output.
 Since the backward and forward N-gram are divided
into two different sets of N-gram, therefore, the
model can be used when the same word is found
afterwards.
 " "
 Backward 4-gram: input to predict " "
 Forward 4-gram: input to predict " "
6
Experiment: Building a LSTM Model
 Used the Word2vec to project text into the vector
space with 200 dimension
 Used LSTM with three layers of neural network
 Picked the highest score of softmax layer to
predict the word
 " "
 LSTM 2-gram: input to predict " "
 LSTM 3-gram: input to predict " "
 LSTM 4-gram: input to predict " "
7
The Modification of Correctness Rate
in N-gram Model
 7-gram can achieve the best correction rate
8
The Modification of Correctness Rate in
Backward and Forward N-gram Model
 Backward and Forward 4-gram can achieve
the best correction rate
9
The Modification of Correctness Rate
in LSTM Model
 LSTM 6-gram can achieve the best correction
rate

10
Model The ratio of the
correct result of OCR
changes to the
wrong one
The ratio of making
the incorrect result
of OCR changes to
the right one
The ratio of
accuracy of OCR
and the text model
OCR X X 97.30%
7-gram 0.35% 13.06% 97.49%
LSTM 6-gram 0.1% 7.33% 97.5%
BF 4-gram 0.08% 9.54% 97.57%
Comparison of 7-gram, LSTM 6-gram
and BF 4-gram Text Models
 Backward and Forward 4-gram has the best
performance, with the lowest modification error
result and the highest correct results
11
Three Text models with
OCR Top 5 Candidate Words
 The OCR software we use is a Convolution Neural
Network model and to calculate the probability of
classification through softmax function
 When the probability of OCR Top 1 is lower than 95%,
it determines the word might be wrong and will use
mixed model
 Pick the word that has the highest score of the text
model also appeared in OCR Top 5 candidate words
12
Model The ratio of the
correct result of OCR
changes to the
wrong one
The ratio of making
the incorrect result
of OCR changes to
the right one
The ratio of
accuracy of OCR
and the text model
OCR X X 97.30%
7-gram 0.012% 9% 97.63%
LSTM 6-gram 0.13% 16% 97.71%
BF 4-gram 0.009% 5.92% 97.55%
Comparison of Three Text Models
Mixed with the Probability of OCR
 LSTM 6-gram mixed with the probability of OCR that
has the best performance
13
Conclusion: Using Text Model
 N-gram, backward and forward N-gram or LSTM N-
gram text model can increase the ratio of accuracy of
OCR
 Backward and Forward 4-gram model has the lowest
modification error result and the highest correct
result
14
Conclusion: Mixing Text Models with
the Probability of OCR
 By mixing rules of OCR Top 5 candidate words
and probability of Top 1 with text model, it can
archive better result than using text model only
 Mixing the LSTM 6-gram with the probability of
OCR model has the highest correct results
15
Thank you for listening
Ad

Recommended

Duplicate_Quora_Question_Detection
Duplicate_Quora_Question_Detection
Jayavardhan Reddy Peddamail
Learning from similarity and information extraction from structured documents...
Learning from similarity and information extraction from structured documents...
Infrrd
BL Demo Day - July2011 - (7) OCR Profiler and Post-Correction
BL Demo Day - July2011 - (7) OCR Profiler and Post-Correction
IMPACT Centre of Competence
Postcorrection and profiler_bne_demoday
Postcorrection and profiler_bne_demoday
IMPACT Centre of Competence
Bne demoday postcorrection_and_profiler
Bne demoday postcorrection_and_profiler
IMPACT Centre of Competence
Off-line English Character Recognition: A Comparative Survey
Off-line English Character Recognition: A Comparative Survey
idescitation
Contribution of recurrent connectionist language models in improving lstm bas...
Contribution of recurrent connectionist language models in improving lstm bas...
anna8885
Automated Speech Recognition
Automated Speech Recognition
Pruthvij Thakar
T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...
T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...
csandit
Telugu letters dataset and parallel deep convolutional neural network with a...
Telugu letters dataset and parallel deep convolutional neural network with a...
International Journal of Reconfigurable and Embedded Systems
Handwritten Text Recognition and Translation with Audio
Handwritten Text Recognition and Translation with Audio
IRJET Journal
International Journal on Natural Language Computing (IJNLC)
International Journal on Natural Language Computing (IJNLC)
basindavid68
A Review of Prompt-Free Few-Shot Text Classification Methods
A Review of Prompt-Free Few-Shot Text Classification Methods
kevig
A REVIEW OF PROMPT-FREE FEW-SHOT TEXT CLASSIFICATION METHODS
A REVIEW OF PROMPT-FREE FEW-SHOT TEXT CLASSIFICATION METHODS
kevig
Session6 01.helmut schmid
Session6 01.helmut schmid
IMPACT Centre of Competence
Session7 03.katrien depuydt
Session7 03.katrien depuydt
IMPACT Centre of Competence
Session7 02.peter kiraly
Session7 02.peter kiraly
IMPACT Centre of Competence
Session6 04.giuseppe celano
Session6 04.giuseppe celano
IMPACT Centre of Competence
Session6 03.sandra young
Session6 03.sandra young
IMPACT Centre of Competence
Session6 02.jeremi ochab
Session6 02.jeremi ochab
IMPACT Centre of Competence
Session5 04.evangelos varthis
Session5 04.evangelos varthis
IMPACT Centre of Competence
Session5 03.george rehm
Session5 03.george rehm
IMPACT Centre of Competence
Session5 02.tom derrick
Session5 02.tom derrick
IMPACT Centre of Competence
Session5 01.rutger vankoert
Session5 01.rutger vankoert
IMPACT Centre of Competence
Session4 04.senka drobac
Session4 04.senka drobac
IMPACT Centre of Competence
Session3 04.arnau baro
Session3 04.arnau baro
IMPACT Centre of Competence
Session3 03.christian clausner
Session3 03.christian clausner
IMPACT Centre of Competence
Session3 02.kimmo ketunnen
Session3 02.kimmo ketunnen
IMPACT Centre of Competence
Session3 01.clemens neudecker
Session3 01.clemens neudecker
IMPACT Centre of Competence
Session2 04.ashkan ashkpour
Session2 04.ashkan ashkpour
IMPACT Centre of Competence

More Related Content

Similar to Session1 03.hsian-an wang (6)

T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...
T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...
csandit
Telugu letters dataset and parallel deep convolutional neural network with a...
Telugu letters dataset and parallel deep convolutional neural network with a...
International Journal of Reconfigurable and Embedded Systems
Handwritten Text Recognition and Translation with Audio
Handwritten Text Recognition and Translation with Audio
IRJET Journal
International Journal on Natural Language Computing (IJNLC)
International Journal on Natural Language Computing (IJNLC)
basindavid68
A Review of Prompt-Free Few-Shot Text Classification Methods
A Review of Prompt-Free Few-Shot Text Classification Methods
kevig
A REVIEW OF PROMPT-FREE FEW-SHOT TEXT CLASSIFICATION METHODS
A REVIEW OF PROMPT-FREE FEW-SHOT TEXT CLASSIFICATION METHODS
kevig
T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...
T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...
csandit
Handwritten Text Recognition and Translation with Audio
Handwritten Text Recognition and Translation with Audio
IRJET Journal
International Journal on Natural Language Computing (IJNLC)
International Journal on Natural Language Computing (IJNLC)
basindavid68
A Review of Prompt-Free Few-Shot Text Classification Methods
A Review of Prompt-Free Few-Shot Text Classification Methods
kevig
A REVIEW OF PROMPT-FREE FEW-SHOT TEXT CLASSIFICATION METHODS
A REVIEW OF PROMPT-FREE FEW-SHOT TEXT CLASSIFICATION METHODS
kevig

More from IMPACT Centre of Competence (20)

Session6 01.helmut schmid
Session6 01.helmut schmid
IMPACT Centre of Competence
Session7 03.katrien depuydt
Session7 03.katrien depuydt
IMPACT Centre of Competence
Session7 02.peter kiraly
Session7 02.peter kiraly
IMPACT Centre of Competence
Session6 04.giuseppe celano
Session6 04.giuseppe celano
IMPACT Centre of Competence
Session6 03.sandra young
Session6 03.sandra young
IMPACT Centre of Competence
Session6 02.jeremi ochab
Session6 02.jeremi ochab
IMPACT Centre of Competence
Session5 04.evangelos varthis
Session5 04.evangelos varthis
IMPACT Centre of Competence
Session5 03.george rehm
Session5 03.george rehm
IMPACT Centre of Competence
Session5 02.tom derrick
Session5 02.tom derrick
IMPACT Centre of Competence
Session5 01.rutger vankoert
Session5 01.rutger vankoert
IMPACT Centre of Competence
Session4 04.senka drobac
Session4 04.senka drobac
IMPACT Centre of Competence
Session3 04.arnau baro
Session3 04.arnau baro
IMPACT Centre of Competence
Session3 03.christian clausner
Session3 03.christian clausner
IMPACT Centre of Competence
Session3 02.kimmo ketunnen
Session3 02.kimmo ketunnen
IMPACT Centre of Competence
Session3 01.clemens neudecker
Session3 01.clemens neudecker
IMPACT Centre of Competence
Session2 04.ashkan ashkpour
Session2 04.ashkan ashkpour
IMPACT Centre of Competence
Session2 03.juri opitz
Session2 03.juri opitz
IMPACT Centre of Competence
Session2 02.christian reul
Session2 02.christian reul
IMPACT Centre of Competence
Session2 01.emad mohamed
Session2 01.emad mohamed
IMPACT Centre of Competence
Session1 04.florian fink
Session1 04.florian fink
IMPACT Centre of Competence
Ad

Recently uploaded (20)

FIDO Seminar: Authentication for a Billion Consumers - Amazon.pptx
FIDO Seminar: Authentication for a Billion Consumers - Amazon.pptx
FIDO Alliance
UserCon Belgium: Honey, VMware increased my bill
UserCon Belgium: Honey, VMware increased my bill
stijn40
9-1-1 Addressing: End-to-End Automation Using FME
9-1-1 Addressing: End-to-End Automation Using FME
Safe Software
The Future of AI Agent Development Trends to Watch.pptx
The Future of AI Agent Development Trends to Watch.pptx
Lisa ward
cnc-processing-centers-centateq-p-110-en.pdf
cnc-processing-centers-centateq-p-110-en.pdf
AmirStern2
War_And_Cyber_3_Years_Of_Struggle_And_Lessons_For_Global_Security.pdf
War_And_Cyber_3_Years_Of_Struggle_And_Lessons_For_Global_Security.pdf
biswajitbanerjee38
CapCut Pro Crack For PC Latest Version {Fully Unlocked} 2025
CapCut Pro Crack For PC Latest Version {Fully Unlocked} 2025
pcprocore
GenAI Opportunities and Challenges - Where 370 Enterprises Are Focusing Now.pdf
GenAI Opportunities and Challenges - Where 370 Enterprises Are Focusing Now.pdf
Priyanka Aash
Smarter Aviation Data Management: Lessons from Swedavia Airports and Sweco
Smarter Aviation Data Management: Lessons from Swedavia Airports and Sweco
Safe Software
FIDO Seminar: Evolving Landscape of Post-Quantum Cryptography.pptx
FIDO Seminar: Evolving Landscape of Post-Quantum Cryptography.pptx
FIDO Alliance
"Database isolation: how we deal with hundreds of direct connections to the d...
"Database isolation: how we deal with hundreds of direct connections to the d...
Fwdays
FIDO Seminar: Targeting Trust: The Future of Identity in the Workforce.pptx
FIDO Seminar: Targeting Trust: The Future of Identity in the Workforce.pptx
FIDO Alliance
Creating Inclusive Digital Learning with AI: A Smarter, Fairer Future
Creating Inclusive Digital Learning with AI: A Smarter, Fairer Future
Impelsys Inc.
Improving Data Integrity: Synchronization between EAM and ArcGIS Utility Netw...
Improving Data Integrity: Synchronization between EAM and ArcGIS Utility Netw...
Safe Software
ReSTIR [DI]: Spatiotemporal reservoir resampling for real-time ray tracing ...
ReSTIR [DI]: Spatiotemporal reservoir resampling for real-time ray tracing ...
revolcs10
2025_06_18 - OpenMetadata Community Meeting.pdf
2025_06_18 - OpenMetadata Community Meeting.pdf
OpenMetadata
Curietech AI in action - Accelerate MuleSoft development
Curietech AI in action - Accelerate MuleSoft development
shyamraj55
PyCon SG 25 - Firecracker Made Easy with Python.pdf
PyCon SG 25 - Firecracker Made Easy with Python.pdf
Muhammad Yuga Nugraha
10 Key Challenges for AI within the EU Data Protection Framework.pdf
10 Key Challenges for AI within the EU Data Protection Framework.pdf
Priyanka Aash
Techniques for Automatic Device Identification and Network Assignment.pdf
Techniques for Automatic Device Identification and Network Assignment.pdf
Priyanka Aash
FIDO Seminar: Authentication for a Billion Consumers - Amazon.pptx
FIDO Seminar: Authentication for a Billion Consumers - Amazon.pptx
FIDO Alliance
UserCon Belgium: Honey, VMware increased my bill
UserCon Belgium: Honey, VMware increased my bill
stijn40
9-1-1 Addressing: End-to-End Automation Using FME
9-1-1 Addressing: End-to-End Automation Using FME
Safe Software
The Future of AI Agent Development Trends to Watch.pptx
The Future of AI Agent Development Trends to Watch.pptx
Lisa ward
cnc-processing-centers-centateq-p-110-en.pdf
cnc-processing-centers-centateq-p-110-en.pdf
AmirStern2
War_And_Cyber_3_Years_Of_Struggle_And_Lessons_For_Global_Security.pdf
War_And_Cyber_3_Years_Of_Struggle_And_Lessons_For_Global_Security.pdf
biswajitbanerjee38
CapCut Pro Crack For PC Latest Version {Fully Unlocked} 2025
CapCut Pro Crack For PC Latest Version {Fully Unlocked} 2025
pcprocore
GenAI Opportunities and Challenges - Where 370 Enterprises Are Focusing Now.pdf
GenAI Opportunities and Challenges - Where 370 Enterprises Are Focusing Now.pdf
Priyanka Aash
Smarter Aviation Data Management: Lessons from Swedavia Airports and Sweco
Smarter Aviation Data Management: Lessons from Swedavia Airports and Sweco
Safe Software
FIDO Seminar: Evolving Landscape of Post-Quantum Cryptography.pptx
FIDO Seminar: Evolving Landscape of Post-Quantum Cryptography.pptx
FIDO Alliance
"Database isolation: how we deal with hundreds of direct connections to the d...
"Database isolation: how we deal with hundreds of direct connections to the d...
Fwdays
FIDO Seminar: Targeting Trust: The Future of Identity in the Workforce.pptx
FIDO Seminar: Targeting Trust: The Future of Identity in the Workforce.pptx
FIDO Alliance
Creating Inclusive Digital Learning with AI: A Smarter, Fairer Future
Creating Inclusive Digital Learning with AI: A Smarter, Fairer Future
Impelsys Inc.
Improving Data Integrity: Synchronization between EAM and ArcGIS Utility Netw...
Improving Data Integrity: Synchronization between EAM and ArcGIS Utility Netw...
Safe Software
ReSTIR [DI]: Spatiotemporal reservoir resampling for real-time ray tracing ...
ReSTIR [DI]: Spatiotemporal reservoir resampling for real-time ray tracing ...
revolcs10
2025_06_18 - OpenMetadata Community Meeting.pdf
2025_06_18 - OpenMetadata Community Meeting.pdf
OpenMetadata
Curietech AI in action - Accelerate MuleSoft development
Curietech AI in action - Accelerate MuleSoft development
shyamraj55
PyCon SG 25 - Firecracker Made Easy with Python.pdf
PyCon SG 25 - Firecracker Made Easy with Python.pdf
Muhammad Yuga Nugraha
10 Key Challenges for AI within the EU Data Protection Framework.pdf
10 Key Challenges for AI within the EU Data Protection Framework.pdf
Priyanka Aash
Techniques for Automatic Device Identification and Network Assignment.pdf
Techniques for Automatic Device Identification and Network Assignment.pdf
Priyanka Aash
Ad

Session1 03.hsian-an wang

  • 1. Towards a Higher Accuracy of Optical Character Recognition of Chinese Rare Books in Making Use of Text Model Hsiang-An Wang Academia Sinica Center for Digital Cultures
  • 2. Ink Bleed and Pool Quality 2
  • 3. Limitation (Missing and Extra Word) OCR Original OCR Original 3
  • 4. Experiment: Data Collection Training dataset: 187 ancient medicine books from the Scripta Sinica Database (about 40 million words) Testing dataset: 1 relevant ancient medicine book named with a total of 185,000 words The OCR results are about 180,000 words correct and about 5000 incorrect words, which means the correct rate is about 97.3 % 4
  • 5. Experiment: Building a N-gram Model Relied on the sequence of words in the training dataset, and thus we picked the highest frequency of output. " " 2-gram: input to predict " " 3-gram: input predict " " 4-gram: input predict " " ... 5
  • 6. Experiment: Building a Backward and Forward N-gram Model Relied on the sequence of backward and forward words in the training dataset, and thus we picked the highest frequency of output. Since the backward and forward N-gram are divided into two different sets of N-gram, therefore, the model can be used when the same word is found afterwards. " " Backward 4-gram: input to predict " " Forward 4-gram: input to predict " " 6
  • 7. Experiment: Building a LSTM Model Used the Word2vec to project text into the vector space with 200 dimension Used LSTM with three layers of neural network Picked the highest score of softmax layer to predict the word " " LSTM 2-gram: input to predict " " LSTM 3-gram: input to predict " " LSTM 4-gram: input to predict " " 7
  • 8. The Modification of Correctness Rate in N-gram Model 7-gram can achieve the best correction rate 8
  • 9. The Modification of Correctness Rate in Backward and Forward N-gram Model Backward and Forward 4-gram can achieve the best correction rate 9
  • 10. The Modification of Correctness Rate in LSTM Model LSTM 6-gram can achieve the best correction rate 10
  • 11. Model The ratio of the correct result of OCR changes to the wrong one The ratio of making the incorrect result of OCR changes to the right one The ratio of accuracy of OCR and the text model OCR X X 97.30% 7-gram 0.35% 13.06% 97.49% LSTM 6-gram 0.1% 7.33% 97.5% BF 4-gram 0.08% 9.54% 97.57% Comparison of 7-gram, LSTM 6-gram and BF 4-gram Text Models Backward and Forward 4-gram has the best performance, with the lowest modification error result and the highest correct results 11
  • 12. Three Text models with OCR Top 5 Candidate Words The OCR software we use is a Convolution Neural Network model and to calculate the probability of classification through softmax function When the probability of OCR Top 1 is lower than 95%, it determines the word might be wrong and will use mixed model Pick the word that has the highest score of the text model also appeared in OCR Top 5 candidate words 12
  • 13. Model The ratio of the correct result of OCR changes to the wrong one The ratio of making the incorrect result of OCR changes to the right one The ratio of accuracy of OCR and the text model OCR X X 97.30% 7-gram 0.012% 9% 97.63% LSTM 6-gram 0.13% 16% 97.71% BF 4-gram 0.009% 5.92% 97.55% Comparison of Three Text Models Mixed with the Probability of OCR LSTM 6-gram mixed with the probability of OCR that has the best performance 13
  • 14. Conclusion: Using Text Model N-gram, backward and forward N-gram or LSTM N- gram text model can increase the ratio of accuracy of OCR Backward and Forward 4-gram model has the lowest modification error result and the highest correct result 14
  • 15. Conclusion: Mixing Text Models with the Probability of OCR By mixing rules of OCR Top 5 candidate words and probability of Top 1 with text model, it can archive better result than using text model only Mixing the LSTM 6-gram with the probability of OCR model has the highest correct results 15
  • 16. Thank you for listening