際際滷

際際滷Share a Scribd company logo
An Empirical Study to Use
Large Language Models to Extract
Named Entities from Repetitive
Texts
Angelica Lo Duca
Researcher @ Institute of Informatics and Telematics
National Research Council, Italy
The Problem
Extract Named Entities from a text (e.g., registry)
with repetitive structure.
The Problem
5510 Sunday 鍖rst day of Rosh Chodesh
Cheshvan (Heb.) which corresponds to
October 12, 1749. A daughter was born to the
Lord Salamon Sezzi and was named Ribqa
bemazal tov (Heb.)
Night of Wednesday 23rd of the month Kislev
(Heb.) 5510 which corresponds to December
3, 1749. A son was born to the Lord Samuel
Cardoso and on the day of the mila he
named him David Haim besiman tov (Heb.)
Tuesday night of the 2nd of the month of
Adar Sheni (Heb.) which corresponds to
March 10, 1750. A daughter was born to the
Lord Angiolo Leucci and she was named
Ester bemazal tov (Heb.)
The Problem
5510 Sunday 鍖rst day of Rosh Chodesh
Cheshvan (Heb.) which corresponds to
October 12, 1749. A daughter was born to the
Lord Salamon Sezzi and was named Ribqa
bemazal tov (Heb.)
Night of Wednesday 23rd of the month Kislev
(Heb.) 5510 which corresponds to December
3, 1749. A son was born to the Lord Samuel
Cardoso and on the day of the mila he
named him David Haim besiman tov (Heb.)
Tuesday night of the 2nd of the month of
Adar Sheni (Heb.) which corresponds to
March 10, 1750. A daughter was born to the
Lord Angiolo Leucci and she was named
Ester bemazal tov (Heb.)
The Problem
5510 Sunday 鍖rst day of Rosh Chodesh
Cheshvan (Heb.) which corresponds to
October 12, 1749. A daughter was born to the
Lord Salamon Sezzi and was named Ribqa
bemazal tov (Heb.)
Night of Wednesday 23rd of the month Kislev
(Heb.) 5510 which corresponds to December
3, 1749. A son was born to the Lord Samuel
Cardoso and on the day of the mila he
named him David Haim besiman tov (Heb.)
Tuesday night of the 2nd of the month of
Adar Sheni (Heb.) which corresponds to
March 10, 1750. A daughter was born to the
Lord Angiolo Leucci and she was named
Ester bemazal tov (Heb.)
The Problem
5510 Sunday 鍖rst day of Rosh Chodesh
Cheshvan (Heb.) which corresponds to
October 12, 1749. A daughter was born to the
Lord Salamon Sezzi and was named Ribqa
bemazal tov (Heb.)
Night of Wednesday 23rd of the month Kislev
(Heb.) 5510 which corresponds to December
3, 1749. A son was born to the Lord Samuel
Cardoso and on the day of the mila he
named him David Haim besiman tov (Heb.)
Tuesday night of the 2nd of the month of
Adar Sheni (Heb.) which corresponds to
March 10, 1750. A daughter was born to the
Lord Angiolo Leucci and she was named
Ester bemazal tov (Heb.)
The Problem
5510 Sunday 鍖rst day of Rosh Chodesh
Cheshvan (Heb.) which corresponds to
October 12, 1749. A daughter was born to the
Lord Salamon Sezzi and was named Ribqa
bemazal tov (Heb.)
Night of Wednesday 23rd of the month Kislev
(Heb.) 5510 which corresponds to December
3, 1749. A son was born to the Lord Samuel
Cardoso and on the day of the mila he
named him David Haim besiman tov (Heb.)
Tuesday night of the 2nd of the month of
Adar Sheni (Heb.) which corresponds to
March 10, 1750. A daughter was born to the
Lord Angiolo Leucci and she was named
Ester bemazal tov (Heb.)
Name Date Sex Father
Ribqa 1749/10/12 F Salamon
Sezzi
David
Haim
1749/12/03 M Samuel
Cardoso
Ester 1750/03/10 F Angiolo
Leucci
Traditional Methods
Rule-based
Rely on prede鍖ned rules
and patterns crafted
based on the languages
linguistic properties.
In a previous paper, we
used this approach for
the same use-case
described in this paper.
Learning-based
Learning-based
approaches utilize
machine learning
algorithms to learn from
annotated datasets.
These approaches can
range from traditional
machine learning
methods to more
advanced deep learning
techniques.
Hybrid
Hybrid approaches
combine the strengths of
both rule-based and
learning-based methods.
They often employ
machine learning models
to capture complex
patterns and use
rule-based systems to
handle well-de鍖ned
entities.
In this paper
Use Large Language Models (LLMs)
to extract Named Entities
Repetitive
Text
Record
LLM-based
App
Instructions
Template
Repetitive
Text
Record
LLM-based
App
Instructions
Template
Birth Registry of
Jewish Community
in Pisa.
Birth Registry of Jewish Community in Pisa
 262 records related to the members of the Pisa Jewish community,
 each record contains a members name, date of birth, sex, and fathers
name.
 Range 1749-1809.
Name Date Sex Father
Ribqa 1749/10/12 F Salamon
Sezzi
David
Haim
1749/12/03 M Samuel
Cardoso
Ester 1750/03/10 F Angiolo
Leucci
5510 Sunday 鍖rst day of Rosh Chodesh Cheshvan (Heb.)
which corresponds to October 12, 1749. A daughter was
born to the Lord Salamon Sezzi and was named Ribqa
bemazal tov (Heb.)
Night of Wednesday 23rd of the month Kislev (Heb.) 5510
which corresponds to December 3, 1749. A son was born to
the Lord Samuel Cardoso and on the day of the mila he
named him David Haim besiman tov (Heb.)
Tuesday night of the 2nd of the month of Adar Sheni (Heb.)
which corresponds to March 10, 1750. A daughter was born
to the Lord Angiolo Leucci and she was named Ester
bemazal tov (Heb.)
An Empirical Study to Use Large Language Models to Extract Named Entities from Repetitive Texts
An Empirical Study to Use Large Language Models to Extract Named Entities from Repetitive Texts
Repetitive
Text
Record
LLM-based
App
Instructions
Template
Compare the
performance of
two LLMs:
 GPT 3.5 Turbo
 GPT 4
Repetitive
Text
Record
LLM-based
App
Instructions
Template
Three levels of instruction templates
 simple
 medium
 detailed
based on the level of detail they
describe.
2 models
 GPT 3.5 Turbo
 GPT 4
3 templates
 simple
 medium
 detailed
x = 6 experiments
Simple template
For each line extract:
- child name, father name,
sex, date of birth and format as CSV
Instructions:
- If you find a son, set sex to M
- If you find a daughter, sex is F
- Do not include besiman tov
in the childs name
Answer by formatting the output in
CSV.
Zero-shot prompting
This consists of giving the
model a task without
providing examples.
Medium template
For each line extract:
- extract child name, father name, sex,
date of birth and format as CSV
Instructions:
- If you find a son, set sex to M
- If you find a daughter, sex is F
- Do not include besiman tov
in the childs name
One-shot prompting
Provide a speci鍖c
example to tell the
model how to respond.
Follow this example:
Input:
<5510 Giorno di domenica primo giorno
di Rosh Chodesh Cheshvan che
corresponde a 12 ottobre 1749.
Naque una figlia al Signore Salamon Sezzi
e si pose nome Ribqa bemazal tov>
Output:
Ribqa,Salamon Sezzi,F,1749-10-17
Detailed template
For each line extract:
- extract child name,father name,
sex, date of birth and format as CSV
Instructions:
- If you find a son, set sex to M
- If you find a daughter, sex is F
- Do not include besiman tov in the childs name
Follow this example:
Input:
<5510 Giorno di domenica primo giorno
di Rosh Chodesh Cheshvan
che corresponde a 12 ottobre 1749.
Naque una figlia al Signore Salamon Sezzi
e si pose nome Ribqa bemazal tov>
Output:
Ribqa,Salamon Sezzi,F,1749-10-17
Answer by formatting the output in CSV.
One-shot prompting
Provide a speci鍖c
example to tell the
model how to respond.
- If the childs name is not present,
add only a comma
- If the fathers name is not present,
add only a comma
- If the date of birth is not present,
add only a comma
Metrics
Father Ratio: the ratio between the number of entities recognized correctly
as a father and the total number of records
Child Ratio
Date Ratio
Sex Ratio
Total Ratio: the ratio between the sum of the number of entities recognized
correctly as a father, the number of entities recognized correctly as a child,
the number of records where the childs sex is correctly identi鍖ed, the
number of records where the childs date of birth is correctly identi鍖ed, and
the total number of records multiplied by four.
An Empirical Study to Use Large Language Models to Extract Named Entities from Repetitive Texts
The easiest to recognize
The most difficult to recognize
GPT 4 Medium and Detailed and
GPT 3.5 Detailed are the best models
High performance
High cost
Cost
Total
Ratio
Low performance
Low cost
Low performance
High cost
High performance
Low cost
High performance
High cost
Cost
Total
Ratio
High performance
Low cost
Low performance
Low cost
Low performance
High cost
An Empirical Study to Use Large Language Models to Extract Named Entities from Repetitive Texts
Lessons from the experiments
 GPT 3.5 Turbo: Ideal for cost-sensitive tasks.
 GPT 4: Best for applications requiring high accuracy.
 Clear instructions optimize results.
 Precision requires clarity.
Conclusions and Future Work
 This paper explored the application of LLMs, speci鍖cally GPT 3.5
Turbo and GPT 4, for extracting named entities from repetitive
texts.
 This paper has demonstrated that all the tested LLMs reach a
total ratio greater than 0.75.
 In all cases, costs should also be considered while choosing the
best model.
 Future work could include comparing them with models released
by different providers, such as Google and Meta.
Thanks for attention!
angelica.loduca@iit.cnr.it
https://www.linkedin.com/in/angelicaloduca/
https://alod83.medium.com/
Questions?
Ad

More Related Content

Recently uploaded (20)

PM003_SERENE-CM-PM-Training Material-EAM Maintenance Notification.pptx
PM003_SERENE-CM-PM-Training Material-EAM Maintenance Notification.pptxPM003_SERENE-CM-PM-Training Material-EAM Maintenance Notification.pptx
PM003_SERENE-CM-PM-Training Material-EAM Maintenance Notification.pptx
afriyanrtanjung007
Drowning in Data but Not Seeing Results?
Drowning in Data but Not Seeing Results?Drowning in Data but Not Seeing Results?
Drowning in Data but Not Seeing Results?
42Signals
PN_Junction_Diode_Typdbhghfned_Notes.pdf
PN_Junction_Diode_Typdbhghfned_Notes.pdfPN_Junction_Diode_Typdbhghfned_Notes.pdf
PN_Junction_Diode_Typdbhghfned_Notes.pdf
AryanGohil1
MLOps_with_SageMaker_Template_EN idioma ingl辿s
MLOps_with_SageMaker_Template_EN idioma ingl辿sMLOps_with_SageMaker_Template_EN idioma ingl辿s
MLOps_with_SageMaker_Template_EN idioma ingl辿s
FabianPierrePeaJacob
Digital Disruption Use Case_Music Industry_for students.pdf
Digital Disruption Use Case_Music Industry_for students.pdfDigital Disruption Use Case_Music Industry_for students.pdf
Digital Disruption Use Case_Music Industry_for students.pdf
ProsenjitMitra9
Hootsuite Social Trends 2025 Report_en.pdf
Hootsuite Social Trends 2025 Report_en.pdfHootsuite Social Trends 2025 Report_en.pdf
Hootsuite Social Trends 2025 Report_en.pdf
lionardoadityabagask
apidays New York 2025 - AI for All by Ananya Upadhyay (United Rentals, Inc.)
apidays New York 2025 - AI for All by Ananya Upadhyay (United Rentals, Inc.)apidays New York 2025 - AI for All by Ananya Upadhyay (United Rentals, Inc.)
apidays New York 2025 - AI for All by Ananya Upadhyay (United Rentals, Inc.)
apidays
463.8-Bitcoin from university of illinois
463.8-Bitcoin from university of illinois463.8-Bitcoin from university of illinois
463.8-Bitcoin from university of illinois
8gqtkfzwbb
apidays New York 2025 - Agentic AI Future by Seena Ganesh (Staples)
apidays New York 2025 - Agentic AI Future by Seena Ganesh (Staples)apidays New York 2025 - Agentic AI Future by Seena Ganesh (Staples)
apidays New York 2025 - Agentic AI Future by Seena Ganesh (Staples)
apidays
Effective_Communication_Skills_Presentation.pptx
Effective_Communication_Skills_Presentation.pptxEffective_Communication_Skills_Presentation.pptx
Effective_Communication_Skills_Presentation.pptx
patharlotadoo
CRITICAL JURNAL KUANTITATIF KEPERAWATAN.pptx
CRITICAL JURNAL KUANTITATIF KEPERAWATAN.pptxCRITICAL JURNAL KUANTITATIF KEPERAWATAN.pptx
CRITICAL JURNAL KUANTITATIF KEPERAWATAN.pptx
monarisdaralina1
Mathcad Sales Presentation software for use.PPTX
Mathcad Sales Presentation software for use.PPTXMathcad Sales Presentation software for use.PPTX
Mathcad Sales Presentation software for use.PPTX
ManojSharma311544
DIGITAL MARKETING TRAINING IN KERALA.pdf
DIGITAL MARKETING TRAINING IN KERALA.pdfDIGITAL MARKETING TRAINING IN KERALA.pdf
DIGITAL MARKETING TRAINING IN KERALA.pdf
aacj102006
03_10_gender_men_masculinity_reforms_policy.pdf
03_10_gender_men_masculinity_reforms_policy.pdf03_10_gender_men_masculinity_reforms_policy.pdf
03_10_gender_men_masculinity_reforms_policy.pdf
LucaMariaPesando1
Chapter VII RECURSION.pdf algor and data structure
Chapter VII RECURSION.pdf algor and data structureChapter VII RECURSION.pdf algor and data structure
Chapter VII RECURSION.pdf algor and data structure
benyakoubrania53
hahehwhwhhwhwhwywtwtwywuwjwjwwnnwnensnsnsnsnsnsnsnnsnsndndndndndndndjdndndCou...
hahehwhwhhwhwhwywtwtwywuwjwjwwnnwnensnsnsnsnsnsnsnnsnsndndndndndndndjdndndCou...hahehwhwhhwhwhwywtwtwywuwjwjwwnnwnensnsnsnsnsnsnsnnsnsndndndndndndndjdndndCou...
hahehwhwhhwhwhwywtwtwywuwjwjwwnnwnensnsnsnsnsnsnsnnsnsndndndndndndndjdndndCou...
T207TrnVnt
L7-SL_en_際際滷s - LLMsIntroduction .pptx
L7-SL_en_際際滷s - LLMsIntroduction .pptxL7-SL_en_際際滷s - LLMsIntroduction .pptx
L7-SL_en_際際滷s - LLMsIntroduction .pptx
kenryostanikegbo
Splunk_ITSI_Interview_Prep_Deck.pptx interview
Splunk_ITSI_Interview_Prep_Deck.pptx interviewSplunk_ITSI_Interview_Prep_Deck.pptx interview
Splunk_ITSI_Interview_Prep_Deck.pptx interview
willmorekanan
Mixed Methods Research.pptx education 201
Mixed Methods Research.pptx education 201Mixed Methods Research.pptx education 201
Mixed Methods Research.pptx education 201
GraceSolaa1
Monterey College of Laws mission is to z
Monterey College of Laws mission is to zMonterey College of Laws mission is to z
Monterey College of Laws mission is to z
seoali2660
PM003_SERENE-CM-PM-Training Material-EAM Maintenance Notification.pptx
PM003_SERENE-CM-PM-Training Material-EAM Maintenance Notification.pptxPM003_SERENE-CM-PM-Training Material-EAM Maintenance Notification.pptx
PM003_SERENE-CM-PM-Training Material-EAM Maintenance Notification.pptx
afriyanrtanjung007
Drowning in Data but Not Seeing Results?
Drowning in Data but Not Seeing Results?Drowning in Data but Not Seeing Results?
Drowning in Data but Not Seeing Results?
42Signals
PN_Junction_Diode_Typdbhghfned_Notes.pdf
PN_Junction_Diode_Typdbhghfned_Notes.pdfPN_Junction_Diode_Typdbhghfned_Notes.pdf
PN_Junction_Diode_Typdbhghfned_Notes.pdf
AryanGohil1
MLOps_with_SageMaker_Template_EN idioma ingl辿s
MLOps_with_SageMaker_Template_EN idioma ingl辿sMLOps_with_SageMaker_Template_EN idioma ingl辿s
MLOps_with_SageMaker_Template_EN idioma ingl辿s
FabianPierrePeaJacob
Digital Disruption Use Case_Music Industry_for students.pdf
Digital Disruption Use Case_Music Industry_for students.pdfDigital Disruption Use Case_Music Industry_for students.pdf
Digital Disruption Use Case_Music Industry_for students.pdf
ProsenjitMitra9
Hootsuite Social Trends 2025 Report_en.pdf
Hootsuite Social Trends 2025 Report_en.pdfHootsuite Social Trends 2025 Report_en.pdf
Hootsuite Social Trends 2025 Report_en.pdf
lionardoadityabagask
apidays New York 2025 - AI for All by Ananya Upadhyay (United Rentals, Inc.)
apidays New York 2025 - AI for All by Ananya Upadhyay (United Rentals, Inc.)apidays New York 2025 - AI for All by Ananya Upadhyay (United Rentals, Inc.)
apidays New York 2025 - AI for All by Ananya Upadhyay (United Rentals, Inc.)
apidays
463.8-Bitcoin from university of illinois
463.8-Bitcoin from university of illinois463.8-Bitcoin from university of illinois
463.8-Bitcoin from university of illinois
8gqtkfzwbb
apidays New York 2025 - Agentic AI Future by Seena Ganesh (Staples)
apidays New York 2025 - Agentic AI Future by Seena Ganesh (Staples)apidays New York 2025 - Agentic AI Future by Seena Ganesh (Staples)
apidays New York 2025 - Agentic AI Future by Seena Ganesh (Staples)
apidays
Effective_Communication_Skills_Presentation.pptx
Effective_Communication_Skills_Presentation.pptxEffective_Communication_Skills_Presentation.pptx
Effective_Communication_Skills_Presentation.pptx
patharlotadoo
CRITICAL JURNAL KUANTITATIF KEPERAWATAN.pptx
CRITICAL JURNAL KUANTITATIF KEPERAWATAN.pptxCRITICAL JURNAL KUANTITATIF KEPERAWATAN.pptx
CRITICAL JURNAL KUANTITATIF KEPERAWATAN.pptx
monarisdaralina1
Mathcad Sales Presentation software for use.PPTX
Mathcad Sales Presentation software for use.PPTXMathcad Sales Presentation software for use.PPTX
Mathcad Sales Presentation software for use.PPTX
ManojSharma311544
DIGITAL MARKETING TRAINING IN KERALA.pdf
DIGITAL MARKETING TRAINING IN KERALA.pdfDIGITAL MARKETING TRAINING IN KERALA.pdf
DIGITAL MARKETING TRAINING IN KERALA.pdf
aacj102006
03_10_gender_men_masculinity_reforms_policy.pdf
03_10_gender_men_masculinity_reforms_policy.pdf03_10_gender_men_masculinity_reforms_policy.pdf
03_10_gender_men_masculinity_reforms_policy.pdf
LucaMariaPesando1
Chapter VII RECURSION.pdf algor and data structure
Chapter VII RECURSION.pdf algor and data structureChapter VII RECURSION.pdf algor and data structure
Chapter VII RECURSION.pdf algor and data structure
benyakoubrania53
hahehwhwhhwhwhwywtwtwywuwjwjwwnnwnensnsnsnsnsnsnsnnsnsndndndndndndndjdndndCou...
hahehwhwhhwhwhwywtwtwywuwjwjwwnnwnensnsnsnsnsnsnsnnsnsndndndndndndndjdndndCou...hahehwhwhhwhwhwywtwtwywuwjwjwwnnwnensnsnsnsnsnsnsnnsnsndndndndndndndjdndndCou...
hahehwhwhhwhwhwywtwtwywuwjwjwwnnwnensnsnsnsnsnsnsnnsnsndndndndndndndjdndndCou...
T207TrnVnt
L7-SL_en_際際滷s - LLMsIntroduction .pptx
L7-SL_en_際際滷s - LLMsIntroduction .pptxL7-SL_en_際際滷s - LLMsIntroduction .pptx
L7-SL_en_際際滷s - LLMsIntroduction .pptx
kenryostanikegbo
Splunk_ITSI_Interview_Prep_Deck.pptx interview
Splunk_ITSI_Interview_Prep_Deck.pptx interviewSplunk_ITSI_Interview_Prep_Deck.pptx interview
Splunk_ITSI_Interview_Prep_Deck.pptx interview
willmorekanan
Mixed Methods Research.pptx education 201
Mixed Methods Research.pptx education 201Mixed Methods Research.pptx education 201
Mixed Methods Research.pptx education 201
GraceSolaa1
Monterey College of Laws mission is to z
Monterey College of Laws mission is to zMonterey College of Laws mission is to z
Monterey College of Laws mission is to z
seoali2660

Featured (20)

2024 Trend Updates: What Really Works In SEO & Content Marketing
2024 Trend Updates: What Really Works In SEO & Content Marketing2024 Trend Updates: What Really Works In SEO & Content Marketing
2024 Trend Updates: What Really Works In SEO & Content Marketing
Search Engine Journal
Storytelling For The Web: Integrate Storytelling in your Design Process
Storytelling For The Web: Integrate Storytelling in your Design ProcessStorytelling For The Web: Integrate Storytelling in your Design Process
Storytelling For The Web: Integrate Storytelling in your Design Process
Chiara Aliotta
Artificial Intelligence, Data and Competition SCHREPEL June 2024 OECD dis...
Artificial Intelligence, Data and Competition  SCHREPEL  June 2024 OECD dis...Artificial Intelligence, Data and Competition  SCHREPEL  June 2024 OECD dis...
Artificial Intelligence, Data and Competition SCHREPEL June 2024 OECD dis...
OECD Directorate for Financial and Enterprise Affairs
How to Leverage AI to Boost Employee Wellness - Lydia Di Francesco - SocialHR...
How to Leverage AI to Boost Employee Wellness - Lydia Di Francesco - SocialHR...How to Leverage AI to Boost Employee Wellness - Lydia Di Francesco - SocialHR...
How to Leverage AI to Boost Employee Wellness - Lydia Di Francesco - SocialHR...
SocialHRCamp
2024 State of Marketing Report by Hubspot
2024 State of Marketing Report  by Hubspot2024 State of Marketing Report  by Hubspot
2024 State of Marketing Report by Hubspot
Marius Sescu
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
Expeed Software
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
Pixeldarts
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
ThinkNow
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
marketingartwork
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
Skeleton Technologies
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
Neil Kimberley
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
contently
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
Albert Qian
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
Kurio // The Social Media Age(ncy)
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
Search Engine Journal
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
SpeakerHub
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
Clark Boyd
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
Tessa Mero
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Lily Ray
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
Rajiv Jayarajah, MAppComm, ACC
2024 Trend Updates: What Really Works In SEO & Content Marketing
2024 Trend Updates: What Really Works In SEO & Content Marketing2024 Trend Updates: What Really Works In SEO & Content Marketing
2024 Trend Updates: What Really Works In SEO & Content Marketing
Search Engine Journal
Storytelling For The Web: Integrate Storytelling in your Design Process
Storytelling For The Web: Integrate Storytelling in your Design ProcessStorytelling For The Web: Integrate Storytelling in your Design Process
Storytelling For The Web: Integrate Storytelling in your Design Process
Chiara Aliotta
How to Leverage AI to Boost Employee Wellness - Lydia Di Francesco - SocialHR...
How to Leverage AI to Boost Employee Wellness - Lydia Di Francesco - SocialHR...How to Leverage AI to Boost Employee Wellness - Lydia Di Francesco - SocialHR...
How to Leverage AI to Boost Employee Wellness - Lydia Di Francesco - SocialHR...
SocialHRCamp
2024 State of Marketing Report by Hubspot
2024 State of Marketing Report  by Hubspot2024 State of Marketing Report  by Hubspot
2024 State of Marketing Report by Hubspot
Marius Sescu
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
Expeed Software
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
Pixeldarts
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
ThinkNow
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
marketingartwork
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
Neil Kimberley
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
contently
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
Albert Qian
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
Kurio // The Social Media Age(ncy)
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
Search Engine Journal
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
SpeakerHub
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
Clark Boyd
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
Tessa Mero
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Lily Ray
Ad

An Empirical Study to Use Large Language Models to Extract Named Entities from Repetitive Texts

  • 1. An Empirical Study to Use Large Language Models to Extract Named Entities from Repetitive Texts Angelica Lo Duca Researcher @ Institute of Informatics and Telematics National Research Council, Italy
  • 2. The Problem Extract Named Entities from a text (e.g., registry) with repetitive structure.
  • 3. The Problem 5510 Sunday 鍖rst day of Rosh Chodesh Cheshvan (Heb.) which corresponds to October 12, 1749. A daughter was born to the Lord Salamon Sezzi and was named Ribqa bemazal tov (Heb.) Night of Wednesday 23rd of the month Kislev (Heb.) 5510 which corresponds to December 3, 1749. A son was born to the Lord Samuel Cardoso and on the day of the mila he named him David Haim besiman tov (Heb.) Tuesday night of the 2nd of the month of Adar Sheni (Heb.) which corresponds to March 10, 1750. A daughter was born to the Lord Angiolo Leucci and she was named Ester bemazal tov (Heb.)
  • 4. The Problem 5510 Sunday 鍖rst day of Rosh Chodesh Cheshvan (Heb.) which corresponds to October 12, 1749. A daughter was born to the Lord Salamon Sezzi and was named Ribqa bemazal tov (Heb.) Night of Wednesday 23rd of the month Kislev (Heb.) 5510 which corresponds to December 3, 1749. A son was born to the Lord Samuel Cardoso and on the day of the mila he named him David Haim besiman tov (Heb.) Tuesday night of the 2nd of the month of Adar Sheni (Heb.) which corresponds to March 10, 1750. A daughter was born to the Lord Angiolo Leucci and she was named Ester bemazal tov (Heb.)
  • 5. The Problem 5510 Sunday 鍖rst day of Rosh Chodesh Cheshvan (Heb.) which corresponds to October 12, 1749. A daughter was born to the Lord Salamon Sezzi and was named Ribqa bemazal tov (Heb.) Night of Wednesday 23rd of the month Kislev (Heb.) 5510 which corresponds to December 3, 1749. A son was born to the Lord Samuel Cardoso and on the day of the mila he named him David Haim besiman tov (Heb.) Tuesday night of the 2nd of the month of Adar Sheni (Heb.) which corresponds to March 10, 1750. A daughter was born to the Lord Angiolo Leucci and she was named Ester bemazal tov (Heb.)
  • 6. The Problem 5510 Sunday 鍖rst day of Rosh Chodesh Cheshvan (Heb.) which corresponds to October 12, 1749. A daughter was born to the Lord Salamon Sezzi and was named Ribqa bemazal tov (Heb.) Night of Wednesday 23rd of the month Kislev (Heb.) 5510 which corresponds to December 3, 1749. A son was born to the Lord Samuel Cardoso and on the day of the mila he named him David Haim besiman tov (Heb.) Tuesday night of the 2nd of the month of Adar Sheni (Heb.) which corresponds to March 10, 1750. A daughter was born to the Lord Angiolo Leucci and she was named Ester bemazal tov (Heb.)
  • 7. The Problem 5510 Sunday 鍖rst day of Rosh Chodesh Cheshvan (Heb.) which corresponds to October 12, 1749. A daughter was born to the Lord Salamon Sezzi and was named Ribqa bemazal tov (Heb.) Night of Wednesday 23rd of the month Kislev (Heb.) 5510 which corresponds to December 3, 1749. A son was born to the Lord Samuel Cardoso and on the day of the mila he named him David Haim besiman tov (Heb.) Tuesday night of the 2nd of the month of Adar Sheni (Heb.) which corresponds to March 10, 1750. A daughter was born to the Lord Angiolo Leucci and she was named Ester bemazal tov (Heb.) Name Date Sex Father Ribqa 1749/10/12 F Salamon Sezzi David Haim 1749/12/03 M Samuel Cardoso Ester 1750/03/10 F Angiolo Leucci
  • 8. Traditional Methods Rule-based Rely on prede鍖ned rules and patterns crafted based on the languages linguistic properties. In a previous paper, we used this approach for the same use-case described in this paper. Learning-based Learning-based approaches utilize machine learning algorithms to learn from annotated datasets. These approaches can range from traditional machine learning methods to more advanced deep learning techniques. Hybrid Hybrid approaches combine the strengths of both rule-based and learning-based methods. They often employ machine learning models to capture complex patterns and use rule-based systems to handle well-de鍖ned entities.
  • 9. In this paper Use Large Language Models (LLMs) to extract Named Entities
  • 12. Birth Registry of Jewish Community in Pisa 262 records related to the members of the Pisa Jewish community, each record contains a members name, date of birth, sex, and fathers name. Range 1749-1809. Name Date Sex Father Ribqa 1749/10/12 F Salamon Sezzi David Haim 1749/12/03 M Samuel Cardoso Ester 1750/03/10 F Angiolo Leucci 5510 Sunday 鍖rst day of Rosh Chodesh Cheshvan (Heb.) which corresponds to October 12, 1749. A daughter was born to the Lord Salamon Sezzi and was named Ribqa bemazal tov (Heb.) Night of Wednesday 23rd of the month Kislev (Heb.) 5510 which corresponds to December 3, 1749. A son was born to the Lord Samuel Cardoso and on the day of the mila he named him David Haim besiman tov (Heb.) Tuesday night of the 2nd of the month of Adar Sheni (Heb.) which corresponds to March 10, 1750. A daughter was born to the Lord Angiolo Leucci and she was named Ester bemazal tov (Heb.)
  • 16. Repetitive Text Record LLM-based App Instructions Template Three levels of instruction templates simple medium detailed based on the level of detail they describe.
  • 17. 2 models GPT 3.5 Turbo GPT 4 3 templates simple medium detailed x = 6 experiments
  • 18. Simple template For each line extract: - child name, father name, sex, date of birth and format as CSV Instructions: - If you find a son, set sex to M - If you find a daughter, sex is F - Do not include besiman tov in the childs name Answer by formatting the output in CSV. Zero-shot prompting This consists of giving the model a task without providing examples.
  • 19. Medium template For each line extract: - extract child name, father name, sex, date of birth and format as CSV Instructions: - If you find a son, set sex to M - If you find a daughter, sex is F - Do not include besiman tov in the childs name One-shot prompting Provide a speci鍖c example to tell the model how to respond. Follow this example: Input: <5510 Giorno di domenica primo giorno di Rosh Chodesh Cheshvan che corresponde a 12 ottobre 1749. Naque una figlia al Signore Salamon Sezzi e si pose nome Ribqa bemazal tov> Output: Ribqa,Salamon Sezzi,F,1749-10-17
  • 20. Detailed template For each line extract: - extract child name,father name, sex, date of birth and format as CSV Instructions: - If you find a son, set sex to M - If you find a daughter, sex is F - Do not include besiman tov in the childs name Follow this example: Input: <5510 Giorno di domenica primo giorno di Rosh Chodesh Cheshvan che corresponde a 12 ottobre 1749. Naque una figlia al Signore Salamon Sezzi e si pose nome Ribqa bemazal tov> Output: Ribqa,Salamon Sezzi,F,1749-10-17 Answer by formatting the output in CSV. One-shot prompting Provide a speci鍖c example to tell the model how to respond. - If the childs name is not present, add only a comma - If the fathers name is not present, add only a comma - If the date of birth is not present, add only a comma
  • 21. Metrics Father Ratio: the ratio between the number of entities recognized correctly as a father and the total number of records Child Ratio Date Ratio Sex Ratio Total Ratio: the ratio between the sum of the number of entities recognized correctly as a father, the number of entities recognized correctly as a child, the number of records where the childs sex is correctly identi鍖ed, the number of records where the childs date of birth is correctly identi鍖ed, and the total number of records multiplied by four.
  • 23. The easiest to recognize
  • 24. The most difficult to recognize
  • 25. GPT 4 Medium and Detailed and GPT 3.5 Detailed are the best models
  • 26. High performance High cost Cost Total Ratio Low performance Low cost Low performance High cost High performance Low cost
  • 27. High performance High cost Cost Total Ratio High performance Low cost Low performance Low cost Low performance High cost
  • 29. Lessons from the experiments GPT 3.5 Turbo: Ideal for cost-sensitive tasks. GPT 4: Best for applications requiring high accuracy. Clear instructions optimize results. Precision requires clarity.
  • 30. Conclusions and Future Work This paper explored the application of LLMs, speci鍖cally GPT 3.5 Turbo and GPT 4, for extracting named entities from repetitive texts. This paper has demonstrated that all the tested LLMs reach a total ratio greater than 0.75. In all cases, costs should also be considered while choosing the best model. Future work could include comparing them with models released by different providers, such as Google and Meta.